---
title: "Standard regression functions in R enabled for parallel processing over large data-frames"
author: "Kevin Blighe"
date: "`r Sys.Date()`"
package: "`r packageVersion('RegParallel')`"
output:
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
    theme: united
    highlight: tango
fig_width: 7
bibliography: library.bib
vignette: >
    %\VignetteIndexEntry{Standard regression functions in R enabled for parallel processing over large data-frames}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\usepackage[utf8]{inputenc}
---

# Introduction

In many analyses, a large amount of variables have to be tested independently
against the trait/endpoint of interest, and also adjusted for covariates and
confounding factors at the same time. The major bottleneck in these is the
amount of time that it takes to complete these analyses.

With <i>RegParallel</i>, a large number of tests can be performed
simultaneously. On a 12-core system, 144 variables can be tested
simultaneously, with 1000s of variables processed in a matter of seconds
via 'nested' parallel processing.

Works for logistic regression, linear regression, conditional logistic
regression, Cox proportional hazards and survival models, Bayesian logistic
regression, and negative binomial regression.


```{r, echo=FALSE}

    library(knitr)

    opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE)

```

# Installation

## 1. Download the package from Bioconductor

```{r getPackage, eval=FALSE}

    if (!requireNamespace('BiocManager', quietly = TRUE))
        install.packages('BiocManager')
        BiocManager::install('RegParallel')

```

Note: to install development version:

```{r getPackageDevel, eval=FALSE}

    devtools::install_github('kevinblighe/RegParallel')

```

## 2. Load the package into R session

```{r Load, message=FALSE}

    library(RegParallel)

```


# Quick start

For this quick start, we will follow the tutorial (from Section 3.1) of
[RNA-seq workflow: gene-level exploratory analysis and differential expression](http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html). Specifically, we will load the
'airway' data, where different airway smooth muscle cells were treated
with dexamethasone.

```{r}

    library(airway)
    library(magrittr)

    data('airway')
    airway$dex %<>% relevel('untrt')

```

Normalise the raw counts in <i>DESeq2</i> and produce regularised log counts:

```{r}

    library(DESeq2)

    dds <- DESeqDataSet(airway, design = ~ dex + cell)
    dds <- DESeq(dds, betaPrior = FALSE)
    rlogcounts <- assay(rlog(dds, blind = FALSE))
    rlogdata <- data.frame(colData(airway), t(rlogcounts))

```

## Perform the most basic logistic regression analysis

Here, we fit a binomial logistic regression model to the data via <i>glmParallel</i>,
with dexamethasone as the dependent variable.

```{r glmParallel1}

    res1 <- RegParallel(
      data = rlogdata[ ,1:3000],
      formula = 'dex ~ [*]',
      FUN = function(formula, data)
        glm(formula = formula,
          data = data,
          family = binomial(link = 'logit')),
      FUNtype = 'glm',
      variables = colnames(rlogdata)[10:3000])

    res1[order(res1$P, decreasing=FALSE),]

```

## Perform a basic linear regression

Here, we will perform the linear regression using both <i>glmParallel</i> and
<i>lmParallel</i>. We will appreciate that a linear regression is the same
using either function with the default settings.

Regularised log counts from our <i>DESeq2</i> data will be used.

```{r lmParallel1}

  rlogdata <- rlogdata[ ,1:2000]

  res2 <- RegParallel(
    data = rlogdata,
    formula = '[*] ~ cell',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = colnames(rlogdata)[10:ncol(rlogdata)],
    p.adjust = "none")

  res3 <- RegParallel(
    data = rlogdata,
    formula = '[*] ~ cell',
    FUN = function(formula, data)
      lm(formula = formula,
        data = data),
    FUNtype = 'lm',
    variables = colnames(rlogdata)[10:ncol(rlogdata)],
    p.adjust = "none")

  subset(res2, P<0.05)
  subset(res3, P<0.05)

```



## Perform the most basic negative binomial logistic regression analysis

Here, we will utilise normalised, unlogged counts from <i>DESeq2</i>. Unlogged
counts in RNA-seq naturally follow a negative binomial / Poisson-like
distribution. <i>glm.nbParallel</i> will be used.

```{r glm.nbParallel1}

    nbcounts <- round(counts(dds, normalized = TRUE), 0)
    nbdata <- data.frame(colData(airway), t(nbcounts))

    res4 <- RegParallel(
      data = nbdata[ ,1:3000],
      formula = '[*] ~ dex',
      FUN = function(formula, data)
        glm.nb(formula = formula,
          data = data),
      FUNtype = 'glm.nb',
      variables = colnames(nbdata)[10:3000],
    p.adjust = "fdr")

    res4[order(res4$Theta, decreasing = TRUE),]

```

```{r, echo=FALSE}

  rm(dds, nbdata, nbcounts, rlogdata, rlogcounts, data, airway)

```

## Survival analysis via Cox Proportional Hazards regression

For this example, we will load breast cancer gene expression data with
recurrence free survival (RFS) from
[Gene Expression Profiling in Breast Cancer: Understanding the Molecular Basis of Histologic Grade To Improve Prognosis](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE2990).
Specifically, we will encode each gene's expression into Low|Mid|High based
on Z-scores and compare these against RFS while adjusting for tumour
grade in a Cox Proportional Hazards model.

First, let's read in and prepare the data:

```{r coxphParallel1}
  library(Biobase)
  library(GEOquery)
  # load series and platform data from GEO
  gset <- getGEO('GSE2990', GSEMatrix =TRUE, getGPL=FALSE)
  x <- exprs(gset[[1]])
  # remove Affymetrix control probes
  x <- x[-grep('^AFFX', rownames(x)),]
  # transform the expression data to Z scores
  x <- t(scale(t(x)))
  # extract information of interest from the phenotype data (pdata)
  idx <- which(colnames(pData(gset[[1]])) %in%
    c('age:ch1', 'distant rfs:ch1', 'er:ch1',
      'ggi:ch1', 'grade:ch1', 'node:ch1',
      'size:ch1', 'time rfs:ch1'))
  metadata <- data.frame(pData(gset[[1]])[,idx],
    row.names = rownames(pData(gset[[1]])))
  # remove samples from the pdata that have any NA value
  discard <- apply(metadata, 1, function(x) any(is.na(x)))
  metadata <- metadata[!discard,]
  # filter the Z-scores expression data to match the samples in our pdata
  x <- x[,which(colnames(x) %in% rownames(metadata))]
  # check that sample names match exactly between pdata and Z-scores 
  all((colnames(x) == rownames(metadata)) == TRUE)
  # create a merged pdata and Z-scores object
  coxdata <- data.frame(metadata, t(x))
  # tidy column names
  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')
  # prepare certain phenotypes
  coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
  coxdata$Distant.RFS <- as.numeric(coxdata$Distant.RFS)
  coxdata$ER <- factor(coxdata$ER, levels = c(0, 1))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
  coxdata$Time.RFS <- as.numeric(gsub('^KJX|^KJ', '', coxdata$Time.RFS))
```

With the data prepared, we can now apply a Cox Proportional Hazards model
independently for each probe in the dataset against RFS.

In this we also increase the default blocksize to 2000 in order to speed
up the analysis.

```{r coxphParallel2}
  library(survival)
  res5 <- RegParallel(
    data = coxdata,
    formula = 'Surv(Time.RFS, Distant.RFS) ~ [*]',
    FUN = function(formula, data)
      coxph(formula = formula,
        data = data,
        ties = 'breslow',
        singular.ok = TRUE),
    FUNtype = 'coxph',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000,
    p.adjust = "BH")
  res5 <- res5[!is.na(res5$P),]
  res5
```

We now take the top probes from the model by Log Rank p-value and use
<i>biomaRt</i> to look up the corresponding gene symbols.

```{r coxphParallel3}
  res5 <- res5[order(res5$LogRank, decreasing = FALSE),]
  final <- subset(res5, LogRank < 0.01)
  probes <- gsub('^X', '', final$Variable)
  library(biomaRt)
  mart <- useMart('ENSEMBL_MART_ENSEMBL')
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  annotLookup <- getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = probes,
    uniqueRows = TRUE)
```

Two of the top hits include <i>CXCL12</i> and <i>MMP10</i>. High expression of
<i>CXCL12</i> was previously associated with good progression free and overall
survival in breast cancer in
(doi: 10.1016/j.cca.2018.05.041.)[https://www.ncbi.nlm.nih.gov/pubmed/29800557]
, whilst high expression of <i>MMP10</i> was associated with poor prognosis in
colon cancer in
(doi:  10.1186/s12885-016-2515-7)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4950722/].

We can further explore the role of these genes to RFS by dividing their
gene expression Z-scores into tertiles for low, mid, and high expression:

```{r coxphParallel4}
  # extract RFS and probe data for downstream analysis
  survplotdata <- coxdata[,c('Time.RFS', 'Distant.RFS',
    'X203666_at', 'X205680_at')]
  colnames(survplotdata) <- c('Time.RFS', 'Distant.RFS',
    'CXCL12', 'MMP10')
  # set Z-scale cut-offs for high and low expression
  highExpr <- 1.0
  lowExpr <- 1.0
  # encode the expression for CXCL12 and MMP10 as low, mid, and high
  survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
    ifelse(x <= lowExpr, 'Low', 'Mid'))
  survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
    ifelse(x <= lowExpr, 'Low', 'Mid'))
  # relevel the factors to have mid as the reference level
  survplotdata$CXCL12 <- factor(survplotdata$CXCL12,
    levels = c('Mid', 'Low', 'High'))
  survplotdata$MMP10 <- factor(survplotdata$MMP10,
    levels = c('Mid', 'Low', 'High'))
```

Plot the survival curves and place Log Rank p-value in the plots:

```{r coxphParallel5, fig.height = 6, fig.width = 6, fig.cap = "Survival analysis via Cox Proportional Hazards regression."}
  library(survminer)
  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ CXCL12,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)
  ggsurvplot(survfit(Surv(Time.RFS, Distant.RFS) ~ MMP10,
    data = survplotdata),
    data = survplotdata,
    risk.table = TRUE,
    pval = TRUE,
    break.time.by = 500,
    ggtheme = theme_minimal(),
    risk.table.y.text.col = TRUE,
    risk.table.y.text = FALSE)
```


## Perform a conditional logistic regression

In this example, we will re-use the Cox data for the purpose of performing
conditional logistic regression with tumour grade as our grouping / matching
factor. For this example, we will use ER status as the dependent variable and
also adjust for age.

```{r clogitParallel}

  x <- exprs(gset[[1]])
  x <- x[-grep('^AFFX', rownames(x)),]
  x <- scale(x)
  x <- x[,which(colnames(x) %in% rownames(metadata))]

  coxdata <- data.frame(metadata, t(x))

  colnames(coxdata)[1:8] <- c('Age', 'Distant.RFS', 'ER',
    'GGI', 'Grade', 'Node',
    'Size', 'Time.RFS')

  coxdata$Age <- as.numeric(gsub('^KJ', '', coxdata$Age))
  coxdata$Grade <- factor(coxdata$Grade, levels = c(1, 2, 3))
  coxdata$ER <- as.numeric(coxdata$ER)
  coxdata <- coxdata[!is.na(coxdata$ER),]

  res6 <- RegParallel(
    data = coxdata,
    formula = 'ER ~ [*] + Age + strata(Grade)',
    FUN = function(formula, data)
      clogit(formula = formula,
        data = data,
        method = 'breslow'),
    FUNtype = 'clogit',
    variables = colnames(coxdata)[9:ncol(coxdata)],
    blocksize = 2000)

  subset(res6, P < 0.01)

  getBM(mart = mart,
    attributes = c('affy_hg_u133a',
      'ensembl_gene_id',
      'gene_biotype',
      'external_gene_name'),
    filter = 'affy_hg_u133a',
    values = c('204667_at',
      '205225_at',
      '207813_s_at',
      '212108_at',
      '219497_s_at'),
    uniqueRows=TRUE)

```

Oestrogen receptor (<i>ESR1</i>) comes out top - makes sense! Also, although
204667_at is not listed in <i>biomaRt</i>, it overlaps an exon of <i>FOXA1</i>,
which also makes sense in relation to oestrogen signalling.


```{r, echo=FALSE}
  rm(coxdata, data, x, gset, survplotdata, highExpr, lowExpr,
    annotLookup, mart, final, probes, idx)
```


# Advanced features

Advanced features include the ability to modify block size, choose different
numbers of cores, enable 'nested' parallel processing, modify limits for
confidence intervals, and exclude certain model terms from output.


## Speed up processing

First create some test data for the purpose of benchmarking:

```{r speedup1}

  options(scipen=10)
  options(digits=6)

  # create a data-matrix of 20 x 60000 (rows x cols) random numbers
  col <- 60000
  row <- 20
  mat <- matrix(
    rexp(col*row, rate = .1),
    ncol = col)

  # add fake gene and sample names
  colnames(mat) <- paste0('gene', 1:ncol(mat))

  rownames(mat) <- paste0('sample', 1:nrow(mat))

  # add some fake metadata
  modelling <- data.frame(
    cell = rep(c('B', 'T'), nrow(mat) / 2),
    group = c(rep(c('treatment'), nrow(mat) / 2), rep(c('control'), nrow(mat) / 2)),
    dosage = t(data.frame(matrix(rexp(row, rate = 1), ncol = row))),
    mat,
    row.names = rownames(mat))

```

### ~2000 tests; blocksize, 500; cores, 2; nestedParallel, TRUE

With 2 cores instead of the default of 3, coupled with nestedParallel being
enabled, a total of  2 x 2 = 4 threads will be used.

```{r speedup2}

  df <- modelling[ ,1:2000]
  variables <- colnames(df)[4:ncol(df)]

  ptm <- proc.time()

  res <- RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 500,
    cores = 2,
    nestedParallel = TRUE,
    p.adjust = "BY")

  proc.time() - ptm

```

### ~2000 tests; blocksize, 500; cores, 2; nestedParallel, FALSE

```{r speedup3}

  df <- modelling[ ,1:2000]
  variables <- colnames(df)[4:ncol(df)]

  ptm <- proc.time()

  res <- RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 500,
    cores = 2,
    nestedParallel = FALSE,
    p.adjust = "BY")

  proc.time() - ptm

```

Focusing on the elapsed time (as system time only reports time from the last
core that finished), we can see that nested processing has negligible
improvement or may actually be slower under certain conditions when tested
over a small number of variables. This is likely due to the system being
slowed by simply managing the larger number of threads. Nested processing's
benefits can only be gained when processing a large number of variables:

### ~40000 tests; blocksize, 2000; cores, 2; nestedParallel, TRUE

```{r speedup4}

  df <- modelling[ ,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 2000,
    cores = 2,
    nestedParallel = TRUE))

```

### ~40000 tests; blocksize, 2000; cores, 2; nestedParallel, FALSE

```{r speedup5}

  df <- modelling[,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 2000,
    cores = 2,
    nestedParallel = FALSE))

```

Performance is system-dependent and even increasing cores may not result in
huge gains in time. Performance is a trade-off between cores, forked threads,
blocksize, and the number of terms in each model.

### ~40000 tests; blocksize, 5000; cores, 3; nestedParallel, TRUE

In this example, we choose a large blocksize and 3 cores. With nestedParallel
enabled, this translates to 9 simultaneous threads.

```{r speedup6}

  df <- modelling[,1:40000]
  variables <- colnames(df)[4:ncol(df)]

  system.time(RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 5000,
    cores = 3,
    nestedParallel = TRUE))

```

## Modify confidence intervals

```{r confint}

  df <- modelling[ ,1:500]
  variables <- colnames(df)[4:ncol(df)]

  # 99% confidfence intervals
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 99)

  # 95% confidfence intervals (default)
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95)

```


## Remove some terms from output / include the intercept

```{r removeterms}

  # remove terms but keep Intercept
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95,
    excludeTerms = c('cell', 'dosage'),
    excludeIntercept = FALSE)

  # remove everything but the variable being tested
  RegParallel(
    data = df,
    formula = 'factor(group) ~ [*] + (cell:dosage) ^ 2',
    FUN = function(formula, data)
      glm(formula = formula,
        data = data,
        family = binomial(link = 'logit'),
        method = 'glm.fit'),
    FUNtype = 'glm',
    variables = variables,
    blocksize = 150,
    cores = 3,
    nestedParallel = TRUE,
    conflevel = 95,
    excludeTerms = c('cell', 'dosage'),
    excludeIntercept = TRUE)

```


# Acknowledgments

*RegParallel* would not exist were it not for initial
contributions from:

[Jessica Lasky-Su](https://connects.catalyst.harvard.edu/Profiles/display/Person/79780),
[Myles Lewis](https://www.qmul.ac.uk/whri/people/academic-staff/items/lewismyles.html),
[Michael Barnes](https://www.qmul.ac.uk/whri/people/academic-staff/items/barnesmichael.html)

Thanks also to Horacio Montenegro and GenoMax for testing.

# Session info

```{r}

sessionInfo()

```

# References

@RegParallel