--- title: "RUVSeq: Remove Unwanted Variation from RNA-Seq Data" author: name: Davide Risso affiliation: Department of Statistical Sciences, University of Padova date: "Last modified: November 22, 2022; Compiled: `r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc: true bibliography: biblio.bib --- ```{r, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview In this document, we show how to conduct a differential expression (DE) analysis that controls for "unwanted variation", e.g., batch, library preparation, and other nuisance effects, using the between-sample normalization methods proposed in @risso2013ruv. We call this approach _RUVSeq_ for _remove unwanted variation from RNA-Seq data_. Briefly, _RUVSeq_ works as follows. For $n$ samples and $J$ genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are regressed on both the known covariates of interest and unknown factors of unwanted variation, \begin{equation} \log E[Y | W, X, O] = W \alpha + X \beta + O. \end{equation} Here, $Y$ is the $n \times J$ matrix of observed gene-level read counts, $W$ is an $n \times k$ matrix corresponding to the factors of "unwanted variation" and $\alpha$ its associated $k \times J$ matrix of nuisance parameters, $X$ is an $n \times p$ matrix corresponding to the $p$ covariates of interest/factors of "wanted variation" (e.g., treatment effect) and $\beta$ its associated $p \times J$ matrix of parameters of interest, and $O$ is an $n \times J$ matrix of offsets that can either be set to zero or estimated with some other normalization procedure (such as upper-quartile normalization). The matrix $X$ is a random variable, assumed to be known a priori. For instance, in the usual two-class comparison setting (e.g., treated vs. control samples), $X$ is an $n \times 2$ design matrix with a column of ones corresponding to an intercept and a column of indicator variables for the class of each sample (e.g., 0 for control and 1 for treated) [@mccullough1989generalized]. The matrix $W$ is an unobserved random variable and $\alpha$, $\beta$, and $k$ are unknown parameters. The simultaneous estimation of $W$, $\alpha$, $\beta$, and $k$ is infeasible. For a given $k$, we consider instead the following three approaches to estimate the factors of unwanted variation $W$: - `RUVg` uses negative control genes, assumed to have constant expression across samples; - `RUVs` uses centered (technical) replicate/negative control samples for which the covariates of interest are constant; - `RUVr` uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest. The resulting estimate of $W$ can then be plugged into Equation \eqref{eq1}, for the full set of genes and samples, and $\alpha$ and $\beta$ estimated by GLM regression. Normalized read counts can be obtained as residuals from ordinary least squares (OLS) regression of $\log Y - O$ on the estimated $W$. Note that although here we illustrate the RUV approach using the GLM implementation of _edgeR_ and _DESeq2_, all three RUV versions can be readily adapted to work with any DE method formulated within a GLM framework. See @risso2013ruv for full details and algorithms for each of the three RUV procedures. # A typical differential expression analysis workflow In this section, we consider the `RUVg` function to estimate the factors of unwanted variation using control genes. See the Sections below for examples using the `RUVs` and `RUVr` approaches. We consider the zebrafish dataset of @ferreira2013silencing, available through the Bioconductor package _zebrafishRNASeq_. The data correspond to RNA libraries for three pairs of gallein-treated and control embryonic zebrafish cell pools. For each of the 6 samples, we have RNA-Seq read counts for $32{,}469$ Ensembl genes and $92$ ERCC spike-in sequences. See @risso2013ruv and the _zebrafishRNASeq_ package vignette for details. ```{r data, warning=FALSE, message=FALSE} library(RUVSeq) library(zebrafishRNASeq) data(zfGenes) head(zfGenes) tail(zfGenes) ``` ## Filtering and exploratory data analysis We filter out non-expressed genes, by requiring more than 5 reads in at least two samples for each gene. ```{r filter} filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2) filtered <- zfGenes[filter,] genes <- rownames(filtered)[grep("^ENS", rownames(filtered))] spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))] ``` After the filtering, we are left with `r length(genes)` genes and `r length(spikes)` spike-ins. We store the data in an object of S4 class _SeqExpressionSet_ from the _EDASeq_ package. This allows us to make full use of the plotting and normalization functionality of _EDASeq_. Note, however, that all the methods in _RUVSeq_ are implemented for both _SeqExpressionSet_ and _matrix_ objects. See the help pages for details. ```{r store_data} x <- as.factor(rep(c("Ctl", "Trt"), each=3)) set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(x, row.names=colnames(filtered))) set ``` The boxplots of relative log expression (RLE = log-ratio of read count to median read count across sample) and plots of principal components (PC) reveal a clear need for betwen-sample normalization. ```{r rle} library(RColorBrewer) colors <- brewer.pal(3, "Set2") plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=colors[x]) plotPCA(set, col=colors[x], cex=1.2) ``` We can use the _betweenLaneNormalization_ function of _EDASeq_ to normalize the data using upper-quartile (UQ) normalization [@bullard2010evaluation]. ```{r uq} set <- betweenLaneNormalization(set, which="upper") plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=colors[x]) plotPCA(set, col=colors[x], cex=1.2) ``` After upper-quartile normalization, treated sample _Trt11_ still shows extra variability when compared to the rest of the samples. This is reflected by the first principal component, that is driven by the difference between _Trt11_ and the other samples. ## RUVg: Estimating the factors of unwanted variation using control genes To estimate the factors of unwanted variation, we need a set of _negative control genes_, i.e., genes that can be assumed not to be influenced by the covariates of interest (in the case of the zebrafish dataset, the Gallein treatment). In many cases, such a set can be identified, e.g., housekeeping genes or spike-in controls. If a good set of negative controls is not readily available, one can define a set of "in-silico empirical" controls. Here, we use the ERCC spike-ins as controls and we consider $k=1$ factors of unwanted variation. See @risso2013ruv and @gagnon2012 for a discussion on the choice of $k$. ```{r ruv_spikes} set1 <- RUVg(set, spikes, k=1) pData(set1) plotRLE(set1, outline=FALSE, ylim=c(-4, 4), col=colors[x]) plotPCA(set1, col=colors[x], cex=1.2) ``` The _RUVg_ function returns two pieces of information: the estimated factors of unwanted variation (added as columns to the _phenoData_ slot of _set_) and the normalized counts obtained by regressing the original counts on the unwanted factors. The normalized values are stored in the _normalizedCounts_ slot of _set_ and can be accessed with the _normCounts_ method. These counts should be used only for exploration. It is important that subsequent DE analysis be done on the _original counts_ (accessible through the _counts_ method), as removing the unwanted factors from the counts can also remove part of a factor of interest [@ruv4]. Note that one can relax the negative control gene assumption by requiring instead the identification of a set of positive or negative controls, with a priori known expression fold-changes between samples, i.e., known $\beta$. One can then use the centered counts for these genes ($\log Y - X\beta$) for normalization purposes. ## Differential expression analysis Now, we are ready to look for differentially expressed genes, using the negative binomial GLM approach implemented in _edgeR_ (see the _edgeR_ package vignette for details). This is done by considering a design matrix that includes both the covariates of interest (here, the treatment status) and the factors of unwanted variation. ```{r edger} design <- model.matrix(~x + W_1, data=pData(set1)) y <- DGEList(counts=counts(set1), group=x) y <- calcNormFactors(y, method="upperquartile") y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit, coef=2) topTags(lrt) ``` ## Empirical control genes If no genes are known _a priori_ not to be influenced by the covariates of interest, one can obtain a set of "in-silico empirical" negative controls, e.g., least significantly DE genes based on a first-pass DE analysis performed prior to RUVg normalization. ```{r empirical} design <- model.matrix(~x, data=pData(set)) y <- DGEList(counts=counts(set), group=x) y <- calcNormFactors(y, method="upperquartile") y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit, coef=2) top <- topTags(lrt, n=nrow(set))$table empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))] ``` Here, we consider all but the top $5{,}000$ genes as ranked by _edgeR_ $p$-values. ```{r emp_ruvg} set2 <- RUVg(set, empirical, k=1) pData(set2) plotRLE(set2, outline=FALSE, ylim=c(-4, 4), col=colors[x]) plotPCA(set2, col=colors[x], cex=1.2) ``` ## Differential expression analysis with _DESeq2_ In alternative to _edgeR_, one can perform differential expression analysis with _DESeq2_. The approach is very similar, namely, we will use the same design matrix, but we need to specify it within the _DESeqDataSet_ object. ```{r deseq2} library(DESeq2) dds <- DESeqDataSetFromMatrix(countData = counts(set1), colData = pData(set1), design = ~ W_1 + x) dds <- DESeq(dds) res <- results(dds) res ``` Note that this will perform by default a Wald test of significance of the last variable in the design formula, in this case $x$. If one wants to perform a likelihood ratio test, she needs to specify a reduced model that includes $W$ (see the _DESeq2_ vignette for more details on the test statistics). ```{r deseq2lrt, eval=FALSE} dds <- DESeq(dds, test="LRT", reduced=as.formula("~ W_1")) res <- results(dds) ``` # RUVs: Estimating the factors of unwanted variation using replicate samples} As an alternative approach, one can use the _RUVs_ method to estimate the factors of unwanted variation using replicate/negative control samples for which the covariates of interest are constant. First, we need to construct a matrix specifying the replicates. In the case of the zebrafish dataset, we can consider the three treated and the three control samples as replicate groups. The function _makeGroups_ can be used. ```{r diff} differences <- makeGroups(x) differences ``` Although in principle one still needs control genes for the estimation of the factors of unwanted variation, we found that _RUVs_ is robust to that choice and that using all the genes works well in practice [@risso2013ruv]. ```{r ruvs} set3 <- RUVs(set, genes, k=1, differences) pData(set3) ``` # RUVr: Estimating the factors of unwanted variation using residuals Finally, a third approach is to consider the residuals (e.g., deviance residuals) from a first-pass GLM regression of the counts on the covariates of interest. This can be achieved with the _RUVr_ method. First, we need to compute the residuals from the GLM fit, without RUVg normalization, but possibly after normalization using a method such as upper-quartile normalization. ```{r res, eval=FALSE} design <- model.matrix(~x, data=pData(set)) y <- DGEList(counts=counts(set), group=x) y <- calcNormFactors(y, method="upperquartile") y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) res <- residuals(fit, type="deviance") ``` Again, we can use all the genes to estimate the factors of unwanted variation. ```{r ruvr, eval=FALSE} set4 <- RUVr(set, genes, k=1, res) ``` # Session info ```{r sessionInfo} sessionInfo() ``` # References