Simulate RNA-seq Data from Real Data

David Gerard

Abstract

We demonstrate how one may use seqgendiff in differential expression simulation studies using the airway data from Himes et al (2014). We use seqgendiff to simulate one dataset which we then analyze with two pipelines: the sva-voom-limma-eBayes-qvalue pipeline, and the sva-DESeq2-qvalue pipeline. In practice, you would simulate many datasets and compare average performance. But playing with a single dataset can help you gain intuition with various methods.

The methods used here are described in Gerard (2020).

Analysis

Set the seed for reproducibility:

set.seed(1)

The airway data are available in the airway R package from Bioconductor. If you don’t have this package, you can install it with the BiocManager package:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("airway")

Now we can load these data into R.

library(airway)
data("airway")

You can read about these data here.

These are the other packages that we’ll need for this vignette:

library(seqgendiff)
library(SummarizedExperiment)
library(DESeq2)
library(limma)
library(sva)
library(qvalue)

The last five packages are all from Bioconductor, which you need to install using BiocManager::install().

The airway dataset comes with a few observed variables which affect gene expression. We’ll consider these “surrogate variables”. That is, we’ll add signal to the data and try to account for these surrogate variables.

coldat <- colData(airway)[, c("cell", "dex")]
true_sv <- model.matrix(~cell + dex, data = coldat)[, -1]
true_sv
#>            cellN061011 cellN080611 cellN61311 dexuntrt
#> SRR1039508           0           0          1        1
#> SRR1039509           0           0          1        0
#> SRR1039512           0           0          0        1
#> SRR1039513           0           0          0        0
#> SRR1039516           0           1          0        1
#> SRR1039517           0           1          0        0
#> SRR1039520           1           0          0        1
#> SRR1039521           1           0          0        0

Suppose we want to add a treatment effect to the RNA-seq data. We’ll first remove all genes with no counts:

allzero <- rowSums(assay(airway)) < 10^-6
airway <- airway[!allzero, ]

Then we’ll use the convenient thin_2group() function from the seqgendiff package to add a \(N(0, 0.8^2)\) log2-effect size to 10% of the genes:

thout <- thin_2group(mat = assay(airway), 
                     prop_null = 0.9, 
                     signal_fun = stats::rnorm,
                     signal_params = list(mean = 0, sd = 0.8))

Let’s see how well the sva-voom-limma-ebayes-qvalue pipeline does estimating these effects. First, we’ll fit this pipeline.

X <- cbind(thout$design_obs, thout$designmat)
Y <- log2(thout$mat + 0.5)
n_sv <- num.sv(dat = Y, mod = X)
svout <- sva(dat = Y, mod = X, n.sv = n_sv)
#> Number of significant surrogate variables is:  2 
#> Iteration (out of 5 ):1  2  3  4  5
vout <- voom(counts = thout$mat, design = cbind(X, svout$sv))
lout <- lmFit(vout)
eout <- eBayes(lout)
qout <- qvalue(p = eout$p.value[, 2])
bhat <- eout$coefficients[, 2]

We plot the coefficient estimates against their true values.

plot(thout$coefmat, 
     bhat, 
     xlab = "True Coefficients", 
     ylab = "Estimated Coefficients")
abline(0, 1, col = 2, lwd = 2)

We plot the q-values against the null status of each gene.

is_null_gene <- abs(thout$coefmat) < 10^-6
boxplot(qout$qvalues ~ is_null_gene,
        xlab = "Null Gene",
        ylab = "q-value")

Does the sva-DESeq2 pipeline do any better? We’ll first use seqgendiff::ThinDataToDESeqDataSet() to convert the ThinData object thout into a DESeqDataSet object.

thout$design_obs <- cbind(thout$design_obs, svout$sv)
dds <- ThinDataToDESeqDataSet(obj = thout)
colData(dds)
#> DataFrame with 8 rows and 3 columns
#>                 V1         V2        P1
#>          <numeric>  <numeric> <numeric>
#> sample1  0.0484600 -0.4083886         1
#> sample2 -0.2708872  0.0966676         0
#> sample3  0.2242304 -0.3340213         0
#> sample4 -0.5995604  0.1820917         1
#> sample5  0.4572990 -0.0470245         0
#> sample6  0.4676437  0.6664217         1
#> sample7 -0.0347875 -0.4100546         0
#> sample8 -0.2923981  0.2543079         1
design(dds)
#> ~V1 + V2 + P1
#> <environment: 0x5626752f14d8>

Note that we added the estimated surrogate variables from sva to the ThinData object before converting to a DESeqDataSet object.

Now we will fit DESeq2.

dds <- DESeq(object = dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
pval_dds <- rowData(dds)[, "WaldPvalue_P1"]
qval_dds <- qvalue(p = pval_dds)

We plot the coefficient estimates against their true values.

coefmat <- rowData(dds)[, c("true_P1", "P1")]
plot(coefmat$true_P1, 
     coefmat$P1,
     xlab = "True Coefficients", 
     ylab = "Estimated Coefficients")
abline(0, 1, col = 2, lwd = 2)

We also plot the q-values against the null status of each gene.

boxplot(qval_dds$qvalues ~ is_null_gene,
        xlab = "Null Gene",
        ylab = "q-value")

The MSE is worse for DESeq2.

mse_limma <- mean((bhat - thout$coefmat) ^ 2)
mse_deseq2 <- mean((coefmat$true_P1 - coefmat$P1) ^ 2, na.rm = TRUE)
mse_deseq2
#> [1] 1.095151
mse_limma
#> [1] 0.2910537

The FDP is about the same for the two pipelines. Both are OK, but conservative.

mean(is_null_gene[qval_dds$qvalues < 0.1], na.rm = TRUE)
#> [1] 0.05263158
mean(is_null_gene[qout$qvalues < 0.1], na.rm = TRUE)
#> [1] 0.01336898

DESeq2 has more discoveries

sum(qval_dds$qvalues < 0.1, na.rm = TRUE)
#> [1] 513
sum(qout$qvalues < 0.1, na.rm = TRUE)
#> [1] 374

As a side-note, it seems that the second surrogate variable is getting at the dexamethasone variable.

boxplot(svout$sv[, 2] ~ true_sv[, 4], 
        xlab = "Dexamethasone Treatment", 
        ylab = "2nd Surrogate Variable")
points(jitter(true_sv[, 4] + 1), svout$sv[, 2])

References