---
title: "Basic Usage & Preprocessing"
date: "`r BiocStyle::doc_date()`"
package: sesame
output: BiocStyle::html_document
fig_width: 6
fig_height: 5
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{"0. Basic Usage"}
  %\VignetteEncoding{UTF-8}
---

```{r message=FALSE, warning=FALSE, include=FALSE}
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Preparation

> As sesame and sesameData are under active development, this documentation is
  specific to the following version of R, sesame, sesameData and ExperimentHub:

```{r message=FALSE, warning=FALSE}
library(sesame)
sesame_checkVersion()
```

We recommend updating your R, ExperimentHub, sesameData and sesame to use this
documentation consistently. If you have installed directly from github, please
make sure the compatible ExperimentHub is installed.

If you use a previous version, please checkout the vignette that corresponds to
the right version here
[https://zhou-lab.github.io/sesame/dev/supplemental.html#Versions](https://zhou-lab.github.io/sesame/dev/supplemental.html#Versions)

**CRITICAL:** After a new installation, one must cache the associated
annotation data using the following command. This needs to be done only once
per SeSAMe installation/update. Caching data to local guarantees proper data
retrieval and saves internet traffic.

```{r message=FALSE}
sesameDataCache()
```

This function caches the needed SeSAMe annotations.
SeSAMe annotation data is managed by the
[sesameData](https://tinyurl.com/58ny3rrt) package which uses the
[ExperimentHub](https://tinyurl.com/2p873tez) infrastructure.  You can find the
location of the cached annotation data on your local computer using:

```{r}
tools::R_user_dir("ExperimentHub", which="cache")
```

# The openSesame Pipeline

The `openSesame` function provides end-to-end processing that converts IDATs to
DNA methylation level (aka $\beta$ value) matrices in R. The function can take
either one of the following input:

- A path to a directory where the IDAT files live
- Specific path(s) of IDAT file prefix, one for each IDAT pair
- One or a list of SigDF objects

The following code uses a directory that contains built-in two HM27 IDAT pairs
to demonstrates the use of `openSesame`:

```{r base1, eval=FALSE}
## The following use the idat_dir in system.file,
## Replace it with your actual IDAT path
idat_dir = system.file("extdata/", package = "sesameData")
betas = openSesame(idat_dir, BPPARAM = BiocParallel::MulticoreParam(2))
```

The `BPPARAM=` option is from the `BiocParallel` package and controls parallel
processing (in this case, we are using two cores). Under the hood, the function
performs a series of tasks including: searching IDAT files from the directory
(the `searchIDATprefixes` function), reading IDAT data in as `SigDF` objects
(the `readIDATpair` function), preprocessing the signals (the `prepSesame`
function), and finally converting them to DNA methylation levels ($\beta$
values, the `getBetas` function). Alternatively, one can run the following
command to get the same results, while gaining more refined control:

```{r base2, eval=FALSE}
##  The above openSesame call is equivalent to:
betas = do.call(cbind, BiocParallel::bplapply(
    searchIDATprefixes(idat_dir), function(pfx) {
        getBetas(prepSesame(readIDATpair(pfx), "QCDPB"))
}, BPPARAM = BiocParallel::MulticoreParam(2)))

## or even more explicitly (if one needs to control argument passed
## to a specific preprocessing function)
betas = do.call(cbind, BiocParallel::bplapply(
    searchIDATprefixes(idat_dir), function(pfx) {
        getBetas(noob(pOOBAH(dyeBiasNL(inferInfiniumIChannel(qualityMask(
            readIDATpair(pfx)))))))
}, BPPARAM = BiocParallel::MulticoreParam(2)))
```

The `openSesame` function is highly customizable. The `prep=` argument is the
same argument one gives to the `prepSesame` function (see [Data Preprocessing]
for detail) which `openSesame` calls internally. The argument uniquely
specifies a preprocessing procedure. The `func=` option specifies the signal
extraction function. It can be either be `getBetas` (DNA methylation) or
`getAFs` (allele frequencies of SNP probes) or NULL (returns `SigDF`). The
`manifest=` option allows one to provide an array manifest when handling data
from platform not supported natively. Finally, the `BPPARAM=` argument is the
same argument taken by `BiocParallel::bplapply` to allow parallel processing.
See [Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.18/supplemental.html#openSesame)
for details of these component functions of openSesame.

The output of `openSesame` can also be customized. It can either be beta
values, which are the end DNA methylation readings, as shown above. It can also
be a list of `SigDF`s which stores the signal intensities and can be further
put back to openSesame for more processing. The `openSesame(func=)` argument
specifies whether the output is a SigDF list or beta values. The following
shows some usage:

```{r base12, eval=FALSE}
betas = openSesame(idat_dir, func = getBetas) # getBetas is the default
sdfs = openSesame(idat_dir, func = NULL) # return SigDF list
allele_freqs = openSesame(idat_dir, func = getAFs) # SNP allele frequencies
sdfs = openSesame(sdfs, prep = "Q", func = NULL)   # take and return SigDFs
```

# Data Preprocessing {#dataprep}

The `prep=` argument instructs the `openSesame` function to call the
`prepSesame` function to preprocess signal intensity under the hood. This can
be skipped by using `prep=""`. The `prepSesame` function takes a single `SigDF`
as input and returns a processed `SigDF`. When `prep=` is non-empty, it selects
the preprocessing functions (see [Preprocessing Function Code](#prep)) and
specifies the order of their execution. For example,

```{r base9}
sdf = sesameDataGet('EPIC.1.SigDF')
sdf_preped = openSesame(sdf, prep="DB", func=NULL)
```

performs dye bias correction (`D`) followed by background subtraction (`B`). In
other words, `prepSesame(sdf, "DB")` is equivalent to
`noob(dyeBiasNL(sdf))`. All the preprocessing functions take a `SigDF` as input
and return an updated `SigDF`. Therefore, these functions can be chained
together. The choice of preprocessing functions and the order of their chaining
is important (see [Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.18/supplemental.html#prepfuns))
for detailed discussions of these functions). The following table lists the
best preprocessing strategy based on our experience.

```{r base10, echo=FALSE, result="asis"}
library(knitr)
df <- data.frame(rbind(
   c("EPICv2/EPIC/HM450", "human", "QCDPB"),
   c("EPICv2/EPIC/HM450", "non-human organism", "SQCDPB"),
   c("MM285", "mouse", "TQCDPB"),
   c("MM285", "non-mouse organism", "SQCDPB"),
   c("Mammal40", "human", "HCDPB"),
   c("Mammal40", "non-human organism", "SHCDPB")))
colnames(df) <- c("Platform", "Sample Organism", "Prep Code")
kable(df, caption="Recommended Preprocessing")
```

The optimal strategy of preprocessing depends on:

1) **The array platform**. For example, certain array platforms (e.g., the
Mammal40) do not have enough Infinium-I probes for background estimation and
dye bias correction, therefore background subtraction (where the out-of-band
signals are from) might not work most optimally;

2) **The expected sample property**. For example, some samples have the
signature bimodal distribution of methylation of most mammalian cells. Others
may undergo global loss of methylation (germ cells, tumors etc). Other
important factors include high-input vs low-input, tumor vs normal, somatic vs
germ cells, human vs model organisms, mouse strains etc. Some platforms (e.g.,
Mammal40 and MM285) are designed for multiple species and strains. Therefore
`S` and `T` would be important when those arrays are used on non-reference
organisms (see [Working with Nonhuman Arrays](nonhuman.html)).

# Preprocessing Function Code {#prep}

The `prepSesameList` function lists all the available codes and the associated
preprocessing functions.

```{r base11}
prepSesameList()
```

Here are some consideration when determining the preprocessing order. Species
`(S)` and strain `(T)` inference resets the mask and color channels based on
probe alignment and presence of genetic variants. Therefore when they are used,
they need to be called first. `Q` masks non-uniquely mapped probes which may
inflate the out-of-band signal for background estimation. Therefore `Q` should
be used before detection p-value calculation (`P`) and background subtraction
(`B`) when necessary. Channel inference (`C`) and dye bias correction (`D`)
should take place early since dye bias effect is global. `C` should be placed
before `D` because dye bias correction uses in-band signal the identification
of which relies on correct channel designation. Detection p-value (`P`) should
happen before background subtraction (`B`) since background subtraction
modifies signal and may affect out-of-band signal assumption used in
`P`. Lastly, functions that explicitly normalizes $\beta$ value distribution
(`M`) should happen last if they even need to be used.

See [Supplemental
Vignette](https://zhou-lab.github.io/sesame/v1.18/supplemental.html#prepfuns)
for details of preprocessing functions.

# Session Info

```{r}
sessionInfo()
```