useR! 2014
Author: Martin Morgan (mtmorgan@fhcrc.org), Sonali
Arora
Date: 30 June, 2014
RNASeq
ChIPSeq – Chromatin immunopreciptation
Variants
Structural
Work flow
Statistical challenges
This work flow derives from a more comprehensive example developed in
the parathyroidSE package and the
CSAMA 2014
workshop. It uses DESeq2 to estimate gene-level differential
expression. Analysis picks up after counting reads aligning to
Ensembl gene regions; counting can be accomplished using
summarizeOverlaps()
, as described in the parathyroidSE vignette.
The data is from Haglund et al., Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas, J Clin Endocrin Metab, Sep 2012. The experiment investigated the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor β agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control.
Start by loading the count data, which has been carefully curatd as a
SummarizedExperiment
object. Explore the object with accessors such
as rowData(se)
, colData(se)
, exptData(se)$MIAME
, and
head(assay(se))
require("DESeq2")
## Loading required package: DESeq2
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
require("parathyroidSE")
## Loading required package: parathyroidSE
data("parathyroidGenesSE")
se <- parathyroidGenesSE
colnames(se) <- se$run
The first step is to add an experimental design and perform additional preparatory steps. The experiment contains several technical replicates, and while these could be incorporated in the model, here we simply pool them (pooling technical replicates is often appropriate for RNASeq data).
ddsFull <- DESeqDataSet(se, design = ~ patient + treatment)
ddsCollapsed <-
collapseReplicates(ddsFull, groupby = ddsFull$sample,
run = ddsFull$run)
dds <- ddsCollapsed[, ddsCollapsed$time == "48h"]
dds$time <- droplevels(dds$time)
dds$treatment <- relevel(dds$treatment, "Control")
The entire work flow is executed with a call to the DESeq()
function. This includes library size factor estimation, dispersion
estimates, model fitting, independent filtering, and formulating test
statistics based on the specified design. Here we generate the results
for a comparison between DPN and control.
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds, contrast = c("treatment", "DPN", "Control"))
The remaining steps provide some basic checks and visualizations of the data. We identify genes for which the adjusted (for multiple comparison) p value is less than 0.1, and order these by their log2 fold change.
resSig <- res[which(res$padj < 0.1 ),]
head(resSig[order(resSig$log2FoldChange),])
## log2 fold change (MAP): treatment DPN vs Control
## Wald test p-value: treatment DPN vs Control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000163631 233.3 -0.9307 0.2842 -3.275 1.057e-03
## ENSG00000119946 151.6 -0.6902 0.1564 -4.412 1.022e-05
## ENSG00000041982 1377.0 -0.6859 0.1845 -3.717 2.018e-04
## ENSG00000155111 530.9 -0.6756 0.2109 -3.203 1.359e-03
## ENSG00000233705 198.6 -0.6727 0.1446 -4.651 3.301e-06
## ENSG00000091137 1143.5 -0.6544 0.1046 -6.254 3.991e-10
## padj
## <numeric>
## ENSG00000163631 5.685e-02
## ENSG00000119946 2.448e-03
## ENSG00000041982 1.986e-02
## ENSG00000155111 6.687e-02
## ENSG00000233705 1.053e-03
## ENSG00000091137 9.228e-07
An 'MA' plot represents each gene as a point, with average expression over all samples on the x-axis, and log2 fold change between treatment groups on the y-axis; highlighted values are genes with adjusted p values less than 0.1.
plotMA(res, ylim = c(-1, 1))
The strategy taken by DESeq2 for dispersion estimation is
summarized by the plotDispEsts()
function. It shows (a) black
per-gene dispersion estimates, (b) a red trend line representing the
global relationship between dispersion and normalized count, © blue
'shrunken' values moderating individual dispersion estimates by the
global relationship, and (d) blue-circled dispersion outliers with
high gene-wise dispersion that were not adjusted.
plotDispEsts(dds, ylim = c(1e-6, 1e1))
A final diagnostic is a plot of the histogram of p values, which should be uniform under the null hypothesis; skew to the right may indicate batch or other effects not accommodated in the model
hist(res$pvalue, breaks=40, col="grey")