---
title: "RUVSeq: Remove Unwanted Variation from RNA-Seq Data"
author: 
    name: Davide Risso
    affiliation: Department of Statistical Sciences, University of Padova 
date: "Last modified: November 22, 2022; Compiled: `r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEncoding{UTF-8}
output:
      BiocStyle::html_document:
        toc: true
bibliography: biblio.bib  
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{RUVSeq: Remove Unwanted Variation from RNA-Seq Data}
-->

```{r, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Overview

In this document, we show how to conduct a differential expression (DE) analysis that controls for "unwanted variation", e.g., batch, library preparation, and other nuisance effects, using the between-sample normalization methods proposed in @risso2013ruv. We call this approach _RUVSeq_ for _remove unwanted variation from RNA-Seq data_.

Briefly, _RUVSeq_ works as follows. For $n$ samples and $J$ genes, consider the following generalized linear model (GLM), where the RNA-Seq read counts are regressed on both the known covariates of interest and unknown factors of unwanted variation, 

\begin{equation}
\log E[Y | W, X, O] = W \alpha + X \beta + O.
\end{equation}

Here, $Y$ is the $n \times J$ matrix of observed gene-level read counts, $W$ is an $n \times k$ matrix corresponding to the factors of "unwanted variation" and $\alpha$ its associated $k \times J$ matrix of nuisance parameters, $X$ is an $n \times p$ matrix corresponding to the $p$ covariates of interest/factors of "wanted variation" (e.g., treatment effect) and $\beta$ its associated $p \times J$ matrix of parameters of interest, and $O$ is an $n \times J$ matrix of offsets that can either be set to zero or estimated with some other normalization procedure (such as upper-quartile normalization).  

The matrix $X$ is a random variable, assumed to be known a priori. For instance, in the usual two-class comparison setting (e.g., treated vs. control samples), $X$ is an $n \times 2$ design matrix with a column of ones corresponding to an intercept and a column of indicator variables for the class of each sample (e.g., 0 for control and 1 for treated) [@mccullough1989generalized]. The matrix $W$ is an unobserved random variable and $\alpha$, $\beta$, and $k$ are unknown
parameters.  

The simultaneous estimation of $W$, $\alpha$, $\beta$, and $k$ is infeasible. For a given $k$, we consider instead the following three approaches to estimate the factors of unwanted variation $W$:

- `RUVg` uses negative control genes, assumed to have constant expression across samples;
- `RUVs` uses centered (technical) replicate/negative control samples for which the covariates of interest are constant; 
- `RUVr` uses residuals, e.g., from a first-pass GLM regression of the counts on the covariates of interest.


The resulting estimate of $W$ can then be plugged into Equation \eqref{eq1}, for the full set of genes and samples, and $\alpha$ and $\beta$ estimated by GLM regression. Normalized read counts can be obtained as residuals from
  ordinary least squares (OLS) regression of $\log Y - O$ on the estimated $W$.
    
Note that although here we illustrate the RUV approach using the GLM implementation of _edgeR_ and _DESeq2_, all three RUV versions can be readily adapted to work with any DE method formulated within a GLM framework.

See @risso2013ruv for full details and algorithms for each of the
three RUV procedures.

# A typical differential expression analysis workflow

In this section, we consider the `RUVg` function to estimate the factors of unwanted variation using control genes. See the Sections below for examples using the `RUVs` and `RUVr` approaches.

We consider the zebrafish dataset of @ferreira2013silencing, available through the Bioconductor package _zebrafishRNASeq_. The data correspond to RNA libraries for three pairs of gallein-treated and control embryonic zebrafish cell pools. For each of the 6 samples, we have RNA-Seq read counts for $32{,}469$ Ensembl genes and $92$ ERCC spike-in sequences. See
@risso2013ruv and the _zebrafishRNASeq_ package vignette
for details.

```{r data, warning=FALSE, message=FALSE}
library(RUVSeq)
library(zebrafishRNASeq)
data(zfGenes)
head(zfGenes)
tail(zfGenes)
``` 

## Filtering and exploratory data analysis

We filter out non-expressed genes, by requiring more than 5 reads
in at least two samples for each gene.

```{r filter}
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^ENS", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]
```

After the filtering, we are left with `r length(genes)` genes and
`r length(spikes)` spike-ins.

We store the data in an object of S4 class _SeqExpressionSet_ from  the _EDASeq_ package. This allows us to make full use of
the plotting and normalization functionality of _EDASeq_. Note,
however, that all the methods in _RUVSeq_ are implemented for both
_SeqExpressionSet_ and _matrix_ objects. See the
help pages for details.

```{r store_data}
x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
                           phenoData = data.frame(x, row.names=colnames(filtered)))
set
```

The boxplots of relative log expression (RLE = log-ratio of read count to median read count across sample) and plots of principal components (PC) reveal a clear need for betwen-sample normalization. 

```{r rle}
library(RColorBrewer)
colors <- brewer.pal(3, "Set2")
plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=colors[x])
plotPCA(set, col=colors[x], cex=1.2)
```

We can use the _betweenLaneNormalization_ function of
_EDASeq_ to normalize the data using upper-quartile (UQ)
normalization [@bullard2010evaluation].

```{r uq}
set <- betweenLaneNormalization(set, which="upper")
plotRLE(set, outline=FALSE, ylim=c(-4, 4), col=colors[x])
plotPCA(set, col=colors[x], cex=1.2)
```

After upper-quartile normalization, treated sample _Trt11_ still
shows extra variability when compared to the rest of the samples.
This is reflected by the first principal
component, that is driven by the difference
between _Trt11_ and the other samples.

## RUVg: Estimating the factors of unwanted variation using control genes

To estimate the factors of unwanted variation, we need a set of _negative
  control genes_, i.e., genes that can be assumed not to be influenced by the
covariates of interest (in the case of the zebrafish dataset, the Gallein
treatment). In many cases, such a set can be identified, e.g., housekeeping genes or spike-in controls. If a good set of
negative controls is not readily available, one can define a set of "in-silico empirical"
controls.

Here, we use the ERCC spike-ins as controls and we consider $k=1$
factors of unwanted variation. See @risso2013ruv and
@gagnon2012 for a discussion on the choice of $k$.

```{r ruv_spikes}
set1 <- RUVg(set, spikes, k=1)
pData(set1)
plotRLE(set1, outline=FALSE, ylim=c(-4, 4), col=colors[x])
plotPCA(set1, col=colors[x], cex=1.2)
```

The _RUVg_ function returns two pieces of information: the
estimated factors of unwanted variation (added as columns to the _phenoData_ slot of _set_) and
the normalized counts obtained by regressing the original counts on
the unwanted factors. The normalized values are stored in the
_normalizedCounts_ slot of _set_ and can be accessed
with the _normCounts_ method. These counts should be used only
for exploration. It is important that subsequent DE analysis be
done on the _original counts_ (accessible through the
_counts_ method), as removing the unwanted factors
from the counts can also remove part of a factor of interest [@ruv4].

Note that one can relax the negative control
  gene assumption by requiring instead the identification of a set of
  positive or negative controls, with a priori known
  expression fold-changes between samples, i.e., known $\beta$.  
One can then use the centered counts for these genes ($\log Y - X\beta$) for normalization purposes. 

## Differential expression analysis

Now, we are ready to look for differentially expressed genes, using
the negative binomial GLM approach implemented in _edgeR_ (see the
_edgeR_ package vignette for details).
This is done by considering a design matrix that includes both the
covariates of interest (here, the treatment status) and the factors of
unwanted variation.

```{r edger}
design <- model.matrix(~x + W_1, data=pData(set1))
y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)
```

## Empirical control genes

If no genes are known _a priori_ not to be influenced by the covariates of interest, one can obtain a set of "in-silico empirical" negative controls, e.g., least significantly  DE genes based on a first-pass DE analysis performed prior to RUVg normalization.

```{r empirical}
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)

top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
```

Here, we consider all but the top $5{,}000$ genes as ranked by
_edgeR_ $p$-values.

```{r emp_ruvg}
set2 <- RUVg(set, empirical, k=1)
pData(set2)
plotRLE(set2, outline=FALSE, ylim=c(-4, 4), col=colors[x])
plotPCA(set2, col=colors[x], cex=1.2)
```

## Differential expression analysis with _DESeq2_

In alternative to _edgeR_, one can perform differential expression analysis with _DESeq2_. The approach is very similar, namely, we will use the same design matrix, but we need to specify it within the _DESeqDataSet_ object.

```{r deseq2}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts(set1),
                              colData = pData(set1),
                              design = ~ W_1 + x)
dds <- DESeq(dds)
res <- results(dds)
res
```

Note that this will perform by default a Wald test of significance of the last variable in the design formula, in this case $x$. If one wants to perform a likelihood ratio test, she needs to specify a reduced model that includes $W$ (see the _DESeq2_ vignette for more details on the test statistics).

```{r deseq2lrt, eval=FALSE}
dds <- DESeq(dds, test="LRT", reduced=as.formula("~ W_1"))
res <- results(dds)
```

# RUVs: Estimating the factors of unwanted variation using replicate samples}

As an alternative approach, one can use the _RUVs_ method to
estimate the factors of unwanted variation using replicate/negative control samples for which the covariates of interest are constant.

First, we need to construct a matrix specifying the replicates. In
the case of the zebrafish dataset, we can consider the three treated
and the three control samples as replicate groups. The function _makeGroups_ can be used.

```{r diff}
differences <- makeGroups(x)
differences
```

Although in principle one still needs control genes for the estimation
of the factors of unwanted variation, we found that _RUVs_ is robust to that choice and that using all the genes works
well in practice [@risso2013ruv].

```{r ruvs}
set3 <- RUVs(set, genes, k=1, differences)
pData(set3)
```

# RUVr: Estimating the factors of unwanted variation using residuals

Finally, a third approach is to consider the residuals (e.g., deviance residuals) from a first-pass GLM regression of the counts on the covariates of interest. This can be achieved with the _RUVr_ method.

First, we need to compute the residuals from the GLM fit, without RUVg normalization, but possibly after normalization using a method such as upper-quartile normalization.

```{r res, eval=FALSE}
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

fit <- glmFit(y, design)
res <- residuals(fit, type="deviance")
```

Again, we can use all the genes to estimate the factors of unwanted
variation.

```{r ruvr, eval=FALSE}
set4 <- RUVr(set, genes, k=1, res)
```

# Session info

```{r sessionInfo}
sessionInfo()
```

# References