---
author: "Belinda Phipson and Jovana Maksimovic"
title: "missMethyl: Analysing Illumina HumanMethylation BeadChip Data"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('BiocStyle')`"
vignette: >
  %\VignetteIndexEntry{missMethyl: Analysing Illumina HumanMethylation BeadChip Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: >
  BiocStyle::html_document
html_document:
  fig_caption: yes
  fig_retina: FALSE
  keep_md: FALSE
bibliography: bibliography.bib
---

# Introduction

The `r BiocStyle::Biocpkg("missMethyl")` package contains functions to analyse methylation 
data from Illumina's HumanMethylation450 and MethylationEPIC beadchip. 
These arrays are a cost-effective alternative to whole genome bisulphite 
sequencing, and as such are widely used to profile DNA methylation. 
Specifically, `r BiocStyle::Biocpkg("missMethyl")` contains functions to perform SWAN 
normalisation [@Maksimovic2012],
perform differential methylation analysis using **RUVm** [@Maksimovic2015],
differential variability analysis [@Phipson2014] and gene set analysis
[@Phipson2016]. As our lab's research into specialised analyses of
these arrays continues we anticipate that the package will be
continuously updated with new functions.

Raw data files are in IDAT format, which can be read into R using the
`r BiocStyle::Biocpkg("minfi")` package [@Aryee2014]. Statistical analyses are usually 
performed on M-values, and $\beta$ values are used for visualisation, 
both of which can be extracted from objects, which is a class of object created 
by `r BiocStyle::Biocpkg("minfi")`. For detecting differentially variable CpGs we recommend that the
analysis is performed on M-values. All analyses described here are
performed at the CpG site level.

# Reading data into R

We will use the data in the `r BiocStyle::Biocpkg("minfiData")` package to demonstrate the 
functions in `r BiocStyle::Biocpkg("missMethyl")`.
The example dataset has 6 samples across two slides. The sample
information is in the targets file. An essential column in the targets
file is the `Basename` column which tells where the idat files to be
read in are located. The R commands to read in the data are taken from
the `r BiocStyle::Biocpkg("minfi")` User's Guide. For additional details on how to read 
the IDAT files into R, as well as information regarding quality control please 
refer to the `r BiocStyle::Biocpkg("minfi")` User's Guide.

```{r load-libs, message=FALSE}
library(missMethyl)
library(limma)
library(minfi)
```

```{r reading-data, message=FALSE}
library(minfiData)
baseDir <- system.file("extdata", package = "minfiData")
targets <- read.metharray.sheet(baseDir)
targets[,1:9]
targets[,10:12]
rgSet <- read.metharray.exp(targets = targets)
```

The data is now an `RGChannelSet` object and needs to be normalised and 
converted to a `MethylSet` object.

# Subset-quantile within array normalization (SWAN)

SWAN (subset-quantile within array normalization) is a within-array
normalization method for Illumina 450k & EPIC BeadChips. Technical differencs
have been demonstrated to exist between the Infinium I and Infinium II
assays on a single Illumina HumanMethylation array [@Bibikova2011, @Dedeurwaerder2011]. Using the 
SWAN method
substantially reduces the technical variability between the assay
designs whilst maintaining important biological differences. The SWAN
method makes the assumption that the number of CpGs within the 50bp
probe sequence reflects the underlying biology of the region being
interrogated. Hence, the overall distribution of intensities of probes
with the same number of CpGs in the probe body should be the same
regardless of assay type. The method then uses a subset quantile
normalization approach to adjust the intensities of each array
[@Maksimovic2012].

`SWAN` can take a `MethylSet`, `RGChannelSet` or `MethyLumiSet` as input. It 
should be noted that, in order to create the normalization subset, `SWAN` 
randomly selects Infinium I and II probes that have one, two and three 
underlying CpGs; as such, we recommend using `set.seed` before to ensure that 
the normalized intensities will be
identical, if the normalization is repeated.

The technical differences between Infinium I and II assay designs can
result in aberrant beta value distributions (Figure 1, panel "Raw"). Using SWAN 
corrects for the technical differences between the
Infinium I and II assay designs and produces a smoother overall $\beta$
value distribution (Figure 1, panel "SWAN").

```{r ppraw}
mSet <- preprocessRaw(rgSet)
```

```{r swan}
mSetSw <- SWAN(mSet,verbose=TRUE)
```

```{r betasByType,fig.height=5,fig.width=10,fig.retina=1,fig.cap="Density distributions of $\beta$ values before and after using SWAN."}
par(mfrow=c(1,2), cex=1.25)
densityByProbeType(mSet[,1], main = "Raw")
densityByProbeType(mSetSw[,1], main = "SWAN")
```

# Filter out poor quality probes

Poor quality probes can be filtered out based on the detection p-value.
For this example, to retain a CpG for further analysis, we require that
the detection p-value is less than 0.01 in all samples.

```{r filtering}
detP <- detectionP(rgSet)
keep <- rowSums(detP < 0.01) == ncol(rgSet)
mSetSw <- mSetSw[keep,]
```

# Extracting Beta and M-values

Now that the data has been `SWAN` normalised we can extract $\beta$ and
M-values from the object. We prefer to add an offset to the methylated
and unmethylated intensities when calculating M-values, hence we extract
the methylated and unmethylated channels separately and perform our own
calculation. For all subsequent analysis we use a random selection of
20000 CpGs to reduce computation time.

```{r extraction}
mset_reduced <- mSetSw[sample(1:nrow(mSetSw), 20000),]
meth <- getMeth(mset_reduced)
unmeth <- getUnmeth(mset_reduced)
Mval <- log2((meth + 100)/(unmeth + 100))
beta <- getBeta(mset_reduced)
dim(Mval)
```

```{r mdsplot,fig.height=5,fig.width=5,fig.cap="A multi-dimensional scaling (MDS) plot of cancer and normal samples."}
par(mfrow=c(1,1))
plotMDS(Mval, labels=targets$Sample_Name, col=as.integer(factor(targets$status)))
legend("topleft",legend=c("Cancer","Normal"),pch=16,cex=1.2,col=1:2)
```

An MDS plot (Figure 2) is a good sanity check to make sure
samples cluster together according to the main factor of interest, in
this case, cancer and normal.

# Testing for differential methylation using 

To test for differential methylation we use the `r BiocStyle::Biocpkg("limma")` 
package [@Smyth2005], which employs an empirical Bayes framework based on 
Guassian model theory. First we need to set up the design matrix. 
There are a number of
ways to do this, the most straightforward is directly from the targets
file. There are a number of variables, with the `status` column indicating
**cancer/normal** samples. From the `person` column of the targets file, we
see that the **cancer/normal** samples are matched, with 3 individuals each
contributing both a **cancer** and **normal** sample. Since the 
`r BiocStyle::Biocpkg("limma")` model framework can handle any experimental design which 
can be summarised by
a design matrix, we can take into account the paired nature of the data
in the analysis. For more complicated experimental designs, please refer
to the `r BiocStyle::Biocpkg("limma")` User's Guide.

```{r design}
group <- factor(targets$status,levels=c("normal","cancer"))
id <- factor(targets$person)
design <- model.matrix(~id + group)
design
```

Now we can test for differential methylation using the `lmFit` and `eBayes` 
functions from `r BiocStyle::Biocpkg("limma")`. As input data we use the matrix of 
M-values.

```{r diffmeth}
fit.reduced <- lmFit(Mval,design)
fit.reduced <- eBayes(fit.reduced)
```

The numbers of hyper-methylated (1) and hypo-methylated (-1) can be
displayed using the `decideTests` function in `r BiocStyle::Biocpkg("limma")` and the top 
10 differentially methylated CpGs for *cancer* versus *normal* extracted using 
`topTable`.

```{r diffmeth-results}
summary(decideTests(fit.reduced))
top<-topTable(fit.reduced,coef=4)
top
```

Note that since we performed our analysis on M-values, the `logFC` and
`AveExpr` columns are computed on the M-value scale. For interpretability
and visualisation we can look at the $\beta$ values. The beta values for
the top 4 differentially methylated CpGs shown in Figure 3.

```{r top4,fig.width=10,fig.height=10,fig.cap="The $\beta$ values for the top 4 differentially methylated CpGs."}
cpgs <- rownames(top)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgs[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgs[i],cex.main=1.5)
}
```

# Removing unwanted variation when testing for differential methylation

Like other platforms, 450k array studies are subject to unwanted
technical variation such as batch effects and other, often unknown,
sources of variation. The adverse effects of unwanted variation have
been extensively documented in gene expression array studies and have
been shown to be able to both reduce power to detect true differences
and to increase the number of false discoveries. As such, when it is
apparent that data is significantly affected by unwanted variation, it
is advisable to perform an adjustment to mitigate its effects.

`r BiocStyle::Biocpkg("missMethyl")` provides a `r BiocStyle::Biocpkg("limma")`-like interface to 
functions from the CRAN package `r BiocStyle::CRANpkg("ruv")` that
enables the removal of unwanted variation when performing a differential
analysis [@Maksimovic2015]. All of the methods rely on negative control features
to accurately estimate the components of unwanted variation. Negative
control features are probes/genes/etc. that are known *a priori* to not
truly be associated with the biological factor of interest, but are
affected by unwanted variation. For example, in a microarray gene
expression study, these could be house-keeping genes or a set of
spike-in controls. Negative control features are extensively discussed
in Gagnon-Bartsch and Speed [-@Gagnon-Bartsch2012] and Gagnon-Bartsch et al.
[-@Gagnon-Bartsch2013]. Once the unwanted factors are accurately estimated from
the data, they are adjusted for in the linear model that describes the
differential analysis.

If the negative control features are not known *a priori*, they can be
identified empirically. This can be achieved using a 2-stage approach,
**RUVm**, based on `RUV-inverse`. Stage 1 involves performing a
differential methylation analysis using `RUV-inverse` and the 613
Illumina negative controls (INCs) as negative control features. This
will produce a list of CpGs ranked by p-value according to their level
of association with the factor of interest. This list can then be used
to identify a set of empirical control probes (ECPs), which will capture
more of the unwanted variation than using the INCs alone. ECPs are
selected by designating a proportion of the CpGs least associated with
the factor of interest as negative control features; this can be done
based on either an FDR cut-off or by taking a fixed percentage of probes
from the bottom of the ranked list. Stage 2 involves performing a second
differential methylation analysis on the original data using
`RUV-inverse` and the ECPs. For simplicity, we are ignoring the paired
nature of the **cancer** and **normal** samples in this example.

```{r diffmeth2}
# get M-values for ALL probes
meth <- getMeth(mSet)
unmeth <- getUnmeth(mSet)
M <- log2((meth + 100)/(unmeth + 100))

grp <- factor(targets$status,levels=c("normal","cancer"))
des <- model.matrix(~grp)
des

INCs <- getINCs(rgSet)
head(INCs)

Mc <- rbind(M,INCs)
ctl <- rownames(Mc) %in% rownames(INCs)
table(ctl)

rfit1 <- RUVfit(data=Mc, design=des, coef=2, ctl=ctl) # Stage 1 analysis
rfit2 <- RUVadj(rfit1)
```

Now that we have performed an initial differential methylation analysis
to rank the CpGs with respect to their association with the factor of
interest, we can designate the CpGs that are least associated with the
factor of interest based on FDR-adjusted p-value as ECPs.

```{r ruv1}
top1 <- topRUV(rfit2, num=Inf)
head(top1)

ctl <- rownames(M) %in% rownames(top1[top1$p.ebayes.BH > 0.5,])
table(ctl)
```

We can then use the ECPs to perform a second differential methylation
with `RUV-inverse`, which is adjusted for the unwanted variation
estimated from the data.

```{r ruv2}
# Perform RUV adjustment and fit
rfit1 <- RUVfit(data=M, design=des, coef=2, ctl=ctl) # Stage 2 analysis
rfit2 <- RUVadj(rfit1)

# Look at table of top results
topRUV(rfit2)
```

Note, at present does not support contrasts, so only one factor of
interest can be interrogated at a time using a design matrix with an
intercept term.

# Testing for differential variability (DiffVar)

## Methylation data

Rather than testing for differences in mean methylation, we may be
interested in testing for differences between group variances. For
example, it has been hypothesised that highly variable CpGs in cancer
are important for tumour progression [@Hansen2011]. Hence we may be
interested in CpG sites that are consistently methylated in the normal
samples, but variably methylated in the cancer samples.

In general we recommend at least 10 samples in each group for accurate
variance estimation, however for the purpose of this vignette we perform
the analysis on 3 vs 3. In this example, we are interested in testing
for differential variability in the cancer versus normal group. Note
that when we specify the `coef` parameter, which corresponds to the
columns of the design matrix to be used for testing differential
variability, we need to specify both the intercept and the fourth
column. The ID variable is a nuisance parameter and not used when
obtaining the absolute deviations, however it can be included in the
linear modelling step. For methylation data, the function will take
either a matrix of M-values, $\beta$ values or a object as input. If
$\beta$ values are supplied, a logit transformation is performed. Note
that as a default, `varFit` uses the robust setting in the `r BiocStyle::Biocpkg("limma")` 
framework, which requires the use of the `r BiocStyle::CRANpkg("statmod")` package.

```{r diffvar}
fitvar <- varFit(Mval, design = design, coef = c(1,4))
```

The numbers of hyper-variable (1) and hypo-variable (-1) genes in **cancer**
vs **normal** can be obtained using `decideTests`.

```{r diffvar-results}
summary(decideTests(fitvar))
topDV <- topVar(fitvar, coef=4)
topDV
```

An alternate parameterisation of the design matrix that does not include
an intercept term can also be used, and specific contrasts tested with 
`contrasts.varFit`.
Here we specify the design matrix such that the first two columns
correspond to the **normal** and **cancer** groups, respectively.

```{r alternative}
design2 <- model.matrix(~0+group+id)
fitvar.contr <- varFit(Mval, design=design2, coef=c(1,2))
contr <- makeContrasts(groupcancer-groupnormal,levels=colnames(design2))
fitvar.contr <- contrasts.varFit(fitvar.contr,contrasts=contr)
```

The results are identical to before.

```{r altresults}
summary(decideTests(fitvar.contr))
topVar(fitvar.contr,coef=1)
```

The $\beta$ values for the top 4 differentially variable CpGs can be
seen in Figure 4.

```{r top4DV,fig.width=10,fig.height=10,fig.cap="The $\beta$ values for the top 4 differentially variable CpGs."}
cpgsDV <- rownames(topDV)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(beta[rownames(beta)==cpgsDV[i],]~design[,4],method="jitter",
group.names=c("Normal","Cancer"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(cpgsDV[i],cex.main=1.5)
}
```

## RNA-Seq expression data

Testing for differential variability in expression data is
straightforward if the technology is gene expression microarrays. The
matrix of expression values can be supplied directly to the `varFit` function.
For RNA-Seq data, the mean-variance relationship that occurs in count
data needs to be taken into account. In order to deal with this issue,
we apply a `voom` transformation [@Law2014] to obtain observation weights, which
are then used in the linear modelling step. For RNA-Seq data, the `varFit`
function will take a `DGElist` object as input.

To demonstrate this, we use data from the `r BiocStyle::Biocpkg("tweeDEseqCountData")` 
package. This data is part of the International HapMap project, consisting of 
RNA-Seq profiles from 69 unrelated Nigerian individuals [@Pickrell2010]. The only 
covariate is gender, so we can look at differentially variable expression between
males and females. We follow the code from the `r BiocStyle::Biocpkg("limma")` vignette to 
read in and process the data before testing for differential variability.

First we load up the data and extract the relevant information.

```{r loadingdata}
library(tweeDEseqCountData)
data(pickrell1)
counts<-exprs(pickrell1.eset)
dim(counts)
gender <- pickrell1.eset$gender
table(gender)
rm(pickrell1.eset)
data(genderGenes)
data(annotEnsembl63)
annot <- annotEnsembl63[,c("Symbol","Chr")]
rm(annotEnsembl63)
```

We now have the counts, gender of each sample and annotation (gene
symbol and chromosome) for each Ensemble gene. We can form a `DGElist` object
using the `r BiocStyle::Biocpkg("edgeR")` package.

```{r dgelist}
library(edgeR)
y <- DGEList(counts=counts, genes=annot[rownames(counts),])
```

We filter out lowly expressed genes by keeping genes with at least 1
count per million reads in at least 20 samples, as well as genes that
have defined annotation. Finally we perform scaling normalisation.

```{r dgelist-filtering}
isexpr <- rowSums(cpm(y)>1) >= 20
hasannot <- rowSums(is.na(y$genes))==0
y <- y[isexpr & hasannot,,keep.lib.sizes=FALSE]
dim(y)
y <- calcNormFactors(y)
```

We set up the design matrix and test for differential variability. In
this case there are no nuisance parameters, so `coef` does not need to
be explicitly specified.

```{r testhapmap}
design.hapmap <- model.matrix(~gender)
fitvar.hapmap <- varFit(y, design = design.hapmap)
fitvar.hapmap$genes <- y$genes
```

We can display the results of the test:

```{r resultshapmap}
summary(decideTests(fitvar.hapmap))
topDV.hapmap <- topVar(fitvar.hapmap,coef=ncol(design.hapmap))
topDV.hapmap
```

The log counts per million for the top 4 differentially variable genes
can be seen in Figure 5.

```{r top4DVhapmap,fig.width=10,fig.height=10,fig.cap="The log counts per million for the top 4 differentially variably expressed genes."}
genesDV <- rownames(topDV.hapmap)
par(mfrow=c(2,2))
for(i in 1:4){
stripchart(cpm(y,log=TRUE)[rownames(y)==genesDV[i],]~design.hapmap[,ncol(design.hapmap)],method="jitter",
group.names=c("Female","Male"),pch=16,cex=1.5,col=c(4,2),ylab="Log counts per million",
vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
title(genesDV[i],cex.main=1.5)
}
```

# Gene ontology analysis

Once a differential methylation or differential variability analysis has
been performed, it may be of interest to know which gene pathways are
targeted by the significant CpG sites. It is not entirely clear from the
literature how best to perform such an analysis, however Geeleher et al.
[@Geeleher2013] showed there is a severe bias when performing gene
ontology analysis with methylation data. This is due to the fact that
there are differing numbers of probes per gene on several different
array technologies. For the Illumina Infinium HumanMethylation450 array
the number of probes per gene ranges from 1 to 1299, with a median of 15
probes per gene. For the EPIC array, the range is 1 to 1487, with a
median of 20 probes per gene. This means that when mapping CpG sites to
genes, a gene is more likely to be *selected* if there are many CpG
sites associated with the gene.

One way to take into account this selection bias is to model the
relationship between the number of probes per gene and the probability
of being selected. This can be performed by adapting the 
`r BiocStyle::Biocpkg("goseq")` method of Young et al. [@Young2010]. Each gene then has a 
prior probability associated
with it, and a modified version of a hypergeometric test can be
performed, testing for over-representation of the selected genes in each
gene set.

The `gometh` function performs gene set testing on GO categories or KEGG pathways
[@Phipson2016]. The `gsameth` function is a more generalised gene set testing
function which can take as input a list of user specified gene sets.
Note that for `gsameth`, the format for the gene ids for each gene in the gene
set needs to be **Entrez Gene IDs**. For example, the entire curated gene
set list (C2) from the Broad's Molecular Signatures Database can be
specified as input. The R version of these lists can be downloaded from
[http://bioinf.wehi.edu.au/software/MSigDB/index.html](here). Both functions
take a vector of significant CpG probe names as input.

To illustrate how to use `gometh`, consider the results from the differential
methylation analysis with **RUVm**.

```{r gometh1}
topRUV(rfit2)
table(rfit2$p.ebayes.BH < 0.01)
```

At a 1% false discovery rate cut-off, there are still tens of thousands
of CpG sites differentially methylated. These will undoubtably map to
almost all the genes in the genome, making a gene ontology analysis
irrelevant. One option for selecting CpGs in this context is to apply
not only a false discovery rate cut-off, but also a $\Delta\beta$
cut-off. However, for this dataset, taking a relatively large
$\Delta\beta$ cut-off of 0.25 still leaves more than 30000 CpGs
differentially methylated.

```{r gometh2}
beta <- getBeta(mSet)
beta_norm <- rowMeans(beta[,des[,2]==0])
beta_can <- rowMeans(beta[,des[,2]==1])
Delta_beta <- beta_can - beta_norm
sigDM <- rfit2$p.ebayes.BH < 0.01 & abs(Delta_beta) > 0.25
table(sigDM)
```

Instead, we take the top 10000 CpG sites as input to `gometh`.

```{r gometh3}
topCpGs<-topRUV(rfit2,number=10000)
sigCpGs <- rownames(topCpGs)
sigCpGs[1:10]
```

The takes as input a character vector of CpG names, and optionally, a
character vector of all CpG sites tested. If the `all.cpg` argument is
omitted, all the CpGs on the array are used as background. To change the
array type, the `array.type` argument can be specified as either
"450K" or "EPIC". The default is "450K".

If the `plot.bias` argument is `TRUE`, a figure showing the relationship
between the probability of being selected and the number of probes per
gene will be displayed.

For testing of GO terms, the `collection` argument takes the value
"GO", which is the default setting. For KEGG pathway analysis, set
`collection` to "KEGG". The function `topGO` shows the top enriched GO
categories. For KEGG testing, use the `topKEGG` function. The functions 
`r BiocStyle::Biocpkg("goana")` and `r BiocStyle::Biocpkg("kegga")` are
called by for GO and KEGG pathway analysis respectively.

For GO testing on our example dataset:

```{r gometh4}
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
gst <- gometh(sig.cpg=sigCpGs, all.cpg=rownames(rfit2), collection="GO")
topGO(gst)
```

For a more generalised version of gene set testing in methylation data
where the user can specify the gene set to be tested, the function `gsameth` can
be used. To display the top 20 pathways, `topGSA` can be called. `gsameth` can 
take a single gene set, or a list of gene sets. The gene identifiers in the
gene set must be **Entrez Gene IDs**. To demonstrate `gsameth`, a toy example is 
shown below, with gene sets made up of randomly selected genes from the
`r BiocStyle::Biocpkg("org.Hs.eg.db")` package.

```{r gsameth}
library(org.Hs.eg.db)
genes <- toTable(org.Hs.egSYMBOL2EG)
set1 <- sample(genes$gene_id,size=80)
set2 <- sample(genes$gene_id,size=100)
set3 <- sample(genes$gene_id,size=30)
genesets <- list(set1,set2,set3)
gsa <- gsameth(sig.cpg=sigCpGs, all.cpg=rownames(rfit2), collection=genesets)
topGSA(gsa)
```

Note that if it is of interest to obtain the **Entrez Gene IDs** that the
significant CpGs are mapped to, the `getMappedEntrezIDs` can be called.

# Session information

```{r sessionInfo, eval=TRUE, results='asis'}
sessionInfo()
```

# References