---
title: "Modeling"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 6
fig_height: 5
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{3. Modeling}
  %\VignetteEncoding{UTF-8}
---

# Differential Methylation

Here we demonstrate the analysis of DNA methylation dependent on one or more
predictors. The predictors might be tissue, cell type, sex, age, tumor/normal,
case/control or a combination of these factors.  The `DML` (Differential
Methylation Locus) function models $\beta$ values (DNA methylation levels)
using mixed linear models. This general supervised learning framework
identifies CpG loci whose differential methylation is associated with known
co-variates and can be used to perform epigenome-wide association studies
(EWAS). Let's first load the needed packages:

```{r model0, include=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```

```{r model1, message=FALSE, warning=FALSE}
library(sesame)
library(SummarizedExperiment)
sesameDataCache() # required at new sesame installation
```

In the following, we will use an MM285 dataset of 10 mouse samples.  This
dataset contains mouse samples from different tissues and mice of different
ages and sexes.  The dataset is stored in a `SummarizedExperiment` object,
which contains a data matrix combined with column-wise metadata accessible with
`colData`:

```{r model2, message=FALSE}
se = sesameDataGet("MM285.10.SE.tissue")[1:1000,] # an arbitrary 1000 CpGs
cd = as.data.frame(colData(se)); rownames(cd) = NULL
cd
```

**CRITICAL:** If your data contains `NA`, it is required that you exclude CpGs
with missing levels. For example, one cannot assess sex-specific DNA
methylation for a CpG that only has non-NA measurement on one sex. Exclusion
of such CpGs for differential methylation modeling can be done using the
`checkLevels` function. Here, we will check this for both sex and tissue:

```{r model3}
se_ok = (checkLevels(assay(se), colData(se)$sex) &
    checkLevels(assay(se), colData(se)$tissue))
sum(se_ok)                      # the number of CpGs that passes
se = se[se_ok,]
```

**NOTE:** If your model include discrete contrast variables like tissue and
sex as in the current example, you should consider explicitly turning it into a
factor variable with a reference level (we use `treatment coding`, see
[different coding systems](
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-
for-categorical-variables/)).

For example, to use `Colon` as the
reference tissue and `Female` as the reference sex, one can do the following

```{r model4}
colData(se)$tissue <- relevel(factor(colData(se)$tissue), "Colon")
colData(se)$sex <- relevel(factor(colData(se)$sex), "Female")
```

Then we will model DNA methylation variation treating tissue and sex as
covariates. To do that we will call the `DML` function and specify the R
formula `~tissue + sex`. This function fits DNA methylation reading to a linear
model and perform the corresponding slope test and goodness-of-fit test (F-test
holding out each contrast variable). See also
[formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)
for how to specify lm/glm-like symbolic form for regression. All these results
are returned in an object of class `DMLSummary`:

```{r model5}
smry = DML(se, ~tissue + sex)
smry
```

You can use `DML(..., BPPARAM=BiocParallel::MulticoreParam(8))` argument to
parallelize the computing.

# Test Interpretation

The `DMLSummary` object is a list of slightly expanded `summary.lm` objects as
is typically returned by `summary(lm())`. The `summaryExtractTest` function
extracts some key test statistics from the `DMLSummary` object and store them
in a data frame. Rows of the data frame correspond to CpGs/loci and columns
contain the slopes and p-values of each variable.

```{r model6}
test_result = summaryExtractTest(smry)
colnames(test_result) # the column names, show four groups of statistics
head(test_result)
```

With the exception of the `Intercept`, there are four groups of columns, each
starting with "Est_", "Pval_", "FPval_", and "Eff_" as prefix. Here are what
they represent:

* **Est_\*** : The slope estimate (aka the $\beta$ coefficient, not to be
confused with the DNA methylation $\beta$-value though) for continuous
variable. DNA methylation difference of the current level with respect to the
reference level for nominal contrast variables. Each suffix is concatenated
from the contrast variable name (e.g., tissue, sex) and the level name if the
contrast variable is discrete (e.g, Cecum, Esophagus, Fat).  For example,
`Est_tissueFat` should be interpreted as the estimated methylation level
difference of Fat compared to the reference tissue (which is `Colon`, as set
above). If reference is not set, the first level in the alphabetic order is
used as the reference level.  There is a special column named
``Est_`(Intercept)` ``.  It corresponds to the base-level methylation of the
reference (in this case a Female Colon sample).

* **Pval_\*** : The unadjusted p-values of t-testing the slope. This
represents the statistical significance of the methylation difference. For
example, `Pval_tissueFat` tests whether `Fat` is significantly different from
`Colon` (the reference level) in DNA methylation. The ``Pval_`(Intercept)` ``
tests whether the reference level is significantly different from zero.

* **FPval_\*** : The unadjusted p-value of the F-test contrasting the full
model against a reduced model with the labeled contrast variable held out. Note
that "Pval_" and "FPval_" are equivalent when the contrast variable is a
2-level factor, i.e., in the case of a pairwise comparison.

* **Eff_\*** : The effect size of each normial contrast variable. This is
equivalent to the maximum slope subtracted by the minimum level including the
reference level (0).

Multiple-testing adjustment can be done afterwards using R's `p.adjust`
function. It is integrated to the `DMR` function by default (see below).

# Goodness of Fit

One may want to ask a question like

> Is the CpG methylation tissue-specific?

rather than

> Is the CpG more methylated in fat compared to liver?

The first question ask about the contrast variable as a whole while the second
question concerns only a specific level in the contrast variable. To answer
this question, we can use an F-test contasting the full model with a reduced
model with the target contrast held out. By default, this statistics is already
computed in the `DML` function. The test result is recorded in the **FPval_**
columns. For example, to get all CpGs that are methylated specific to sex,

```{r model7, message = FALSE}
library(dplyr)
library(tidyr)
test_result %>% dplyr::filter(FPval_sex < 0.05, Eff_sex > 0.1) %>%
    select(FPval_sex, Eff_sex)
```

Here we used 0.1 as the effect size threshold. This means DNA methylation
difference under 0.1 (10%) is considered not biologically meaningful. This can
be a valid assumption for homogenous cell composition as most cells would be
either biallelically methylated, unmethylated or monoallelically
methylated. But different threshold can be used in different analysis
scenarios.

We can define CpG methylation as sex-specific, tissue-specific or both, by:

```{r model8}
test_result %>%
    mutate(sex_specific =
        ifelse(FPval_sex < 0.05 & Eff_sex > 0.1, TRUE, FALSE)) %>%
    mutate(tissue_specific =
        ifelse(FPval_tissue < 0.05 & Eff_tissue > 0.1, TRUE, FALSE)) %>%
    select(sex_specific, tissue_specific) %>% table
```

As you can see from the result, some probes are sex-specific and others are
tissue-specific. There is no overlap between probes whose methylation reading
is differential along both contrasts.

# Pairwise Comparison

From the test result, we can also ask whether the DNA methylation is different
between two sexes or between two specific tissues. For example, `Est_sexMale`
compares male from females. The following code creates a volcano plot.

```{r model9}
library(ggplot2)
ggplot(test_result) + geom_point(aes(Est_sexMale, -log10(Pval_sexMale)))
```

Likewise, we can ask whether DNA methylation might be different between fat
and colon. We can do

```{r model10}
ggplot(test_result) + geom_point(aes(Est_tissueFat, -log10(Pval_tissueFat)))
```

# Continuous Predictors

The variable tested in the `DML` function can be continuous.
Suppose we are interested in `age` besides `sex`. We will call the
same function but with the following formula:

```{r model11}
smry2 = DML(se, ~ age + sex)
test_result2 = summaryExtractTest(smry2) %>% arrange(Est_age)
```

Let's verify the CpGs positively associated with age.

```{r model12}
test_result2 %>% dplyr::select(Probe_ID, Est_age, Pval_age) %>% tail
df = data.frame(Age = colData(se)$age,
    BetaValue = assay(se)[test_result2$Probe_ID[nrow(test_result2)],])
ggplot(df, aes(Age, BetaValue)) + geom_smooth(method="lm") + geom_point()
```

# DMR

For a given contrast, one can merge neighboring CpGs that show consistent
methylation variation into differentially methylated regions (DMRs).
For example, we can merge sex-specific differential methylation identified
above to chromosome X regions that show X-inactivation-related methylation
difference. To do this, we need to pick a contrast:

```{r model13, eval=TRUE}
dmContrasts(smry)                       # pick a contrast from below
merged = DMR(se, smry, "sexMale", platform="MM285") # merge CpGs to regions
merged %>% dplyr::filter(Seg_Pval_adj < 0.01)
```

See [Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.16/supplemental.html#track) for
track-view visualization of the data.

# Session Info

```{r}
sessionInfo()
```