---
title: "SingleMoleculeFootprinting"
output: 
  html_document:
    toc: true
    toc_float: true
    toc_depth: 4
    highlight: tango
vignette: >
  %\VignetteIndexEntry{SingleMoleculeFootprinting}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  # comment = "#>", 
  tidy = FALSE, 
  cache = FALSE, 
  results = 'markup'
)
```

## Introduction
*SingleMoleculeFootprinting* is an R package providing functions to analyze Single Molecule Footprinting (SMF) data. Following the workflow exemplified in this vignette, the user will be able to perform basic data analysis of SMF data with minimal coding effort.

Starting from an aligned bam file, we show how to

* perform quality controls over sequencing libraries
* extract methylation information at the single molecule level accounting for the two possible kind of SMF experiments (single enzyme or double enzyme)
* classify single molecules based on their patterns of molecular occupancy
* plot SMF information at a given genomic location 

For installation, the user can use the following:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SingleMoleculeFootprinting")
```

For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the [`qAlign`](https://www.rdocumentation.org/packages/QuasR/versions/1.12.0/topics/qAlign) function from QuasR as exemplified in the chunk below.
```{r, eval=FALSE}
library(parallel)
library(QuasR)
library(BSgenome.Mmusculus.UCSC.mm10) 

cl <- makeCluster(40)
prj <- QuasR::qAlign(sampleFile = Qinput,
                     genome = "BSgenome.Mmusculus.UCSC.mm10",
                     aligner = "Rbowtie",
                     projectName = "prj", 
                     paired = "fr",
                     bisulfite = "undir", 
                     alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
                     alignmentsDir = "./", 
                     cacheDir = tempdir(),
                     clObj = cl)
```

## Loading
```{r setup, message=FALSE, eval=TRUE}
library(BSgenome.Mmusculus.UCSC.mm10)
library(SingleMoleculeFootprinting)
library(SingleMoleculeFootprintingData)
library(parallel)
```

## Define arguments
*SingleMoleculeFootprinting* inherits *QuasR*'s philosophy of working with pointer files. Briefly, a pointer  is a tab-delimited file with two or three fields indicating the location of the input data files. For more details, please check the `qAlign` [documentation](https://www.rdocumentation.org/packages/QuasR/versions/1.12.0/topics/qAlign). Here we show how our pointer file looks like.

N.b.: The execution of the example code in this vignette, as well as some functions of this software package, depend on accessory data that can be downloaded and cached through the data package *SingleMoleculeFootprintingData*. Under the hood, we download this data for the user and create the QuasR sampleSheet file for the example data at `tempdir()` under the name `NRF1Pair_Qinput.txt`.
We just ask the user to make sure that their default cache directory has enough storage capacity (~1 Gb). The user can check this via `ExperimentHub::getExperimentHubOption(arg = "CACHE")` and eventually change this value via `ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location")`.
```{r, include=FALSE}
# Under the hood:
#       download and cache SingleMoleculeFootprintingData data
#       locate bam file and write QuasR sampleSheet

# Download and cache
CachedBamPath = SingleMoleculeFootprintingData::NRF1pair.bam()[[1]]
CachedBaiPath = SingleMoleculeFootprintingData::NRF1pair.bam.bai()[[1]]
SingleMoleculeFootprintingData::AllCs.rds()
SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
SingleMoleculeFootprintingData::ReferenceMethylation.rds()

# Create copy of bam and bai files with relevant names
CacheDir <- ExperimentHub::getExperimentHubOption(arg = "CACHE")
file.copy(from = CachedBamPath, to = paste0(CacheDir, "/", "NRF1pair.bam"))
file.copy(from = CachedBaiPath, to = paste0(CacheDir, "/", "NRF1pair.bam.bai"))

# Write QuasR input
data.frame(
  FileName = paste0(CacheDir, "/", "NRF1pair.bam"),
  SampleName = "NRF1pair_DE_"
  ) -> df
readr::write_delim(df, paste0(CacheDir, "/NRF1Pair_Qinput.txt"), delim = "\t")
```

```{r, eval=TRUE}
Qinput = paste0(CacheDir, "/NRF1Pair_Qinput.txt")
suppressMessages(readr::read_delim(Qinput, delim = "\t"))
```

## Library QCs
Before investing in a deep sequencing run for a SMF experiment, it is advisable to first perform a shallow sequencing run and to perform quality controls on the sequencing libraries.

### QuasR QC report
It is always a good idea to examine canonical quality measures after aligning. We advice the user to employ the [`qQCreport`](https://www.rdocumentation.org/packages/QuasR/versions/1.12.0/topics/qQCReport) function from *QuasR*.

### Bait capture efficiency
If SMF was performed on an large genome (e.g Mouse) it is possible that bait capture was performed to focus the sequencing efforts: here we check how efficient that process was by essentially computing the ratio of genomic alignments inside bait regions over the total ones.
```{r, eval=TRUE}
BaitRegions <- SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
clObj = makeCluster(5)
BaitCaptureEfficiency <- BaitCapture(sampleSheet = Qinput, 
                                     genome = BSgenome.Mmusculus.UCSC.mm10, 
                                     baits = BaitRegions, clObj = clObj)
stopCluster(clObj)
BaitCaptureEfficiency
```
In this case the capture efficiency equals to `r toString(BaitCaptureEfficiency)` because the example data was purposefully subset for an interesting region which falls entirely within baits. Under normal circumstances, one expects this value to be lower than 1 for the mouse genome for example we observe values around 0.7.

### Conversion rate precision
Here we ask how much of the observed Cytosine methylation falls in the expected contexts (*CpG* or *GpC*). Beware, this is a much slower computation than the above.

```{r conversion-rate-precision, eval=FALSE, echo = TRUE}
ConversionRateValue <- ConversionRate(sampleSheet = Qinput, 
                                          genome = BSgenome.Mmusculus.UCSC.mm10, 
                                          chr = 6)
```

```{r load-conversion-rate-precision, eval=TRUE, echo = FALSE}
## the code above takes > 10 minutes to run, so we load a pre-computed value here
ConversionRateValue <- readRDS(system.file("extdata",
                                           "vignette_ConversionRatePrecision.rds",
                                           package = "SingleMoleculeFootprinting"))
```

For this sample, the observed conversion rate is `r toString(ConversionRateValue)`%. This value happens to be slightly below the expected value of ~95%

## Analysis of single site
### Methylation extraction
Methylation values at the single molecule level can be extracted for the region of interest from aligned data using the `CallContextMethylation` function. 
Under the hood, the function performs the following operations: 

* Identification of the methylation status for the Cytosines in the genomic context relevant for the experiment (Single enzyme, double enzyme, etc.). The type of the experiment is inferred based on which user-provided substring is found in the SampleName field of the QuasR pointer file: 
```{r, echo = FALSE}
knitr::kable(data.frame(ExperimentType = c("Single enzyme SMF", "Double enzyme SMF", "No enzyme (endogenous mCpG only)"), 
                        substring = c("\\_NO_", "\\_DE_", "\\_SS_"), 
                        RelevanContext = c("DGCHN & NWCGW", "GCH + HCG", "CG"), 
                        Notes = c("Methylation info is reported separately for each context", "Methylation information is aggregated across the contexts", "-")))
```

* Filtering reads based on conversion rate
* Collapsing strands to "-"
* Filtering Cytosines for coverage
```{r, eval=TRUE}
MySample <- suppressMessages(readr::read_delim(Qinput, delim = "\t")[[2]])
Region_of_interest <- GRanges(seqnames = "chr6", 
                              ranges = IRanges(start = 88106000, 
                                               end = 88106500), 
                              strand = "*")

Methylation <- CallContextMethylation(sampleSheet = Qinput, 
                                      sample = MySample, 
                                      genome = BSgenome.Mmusculus.UCSC.mm10, 
                                      range = Region_of_interest, 
                                      coverage = 20, 
                                      ConvRate.thr = 0.2)
Methylation[[1]]
Methylation[[2]][1:10, 1:10]
```
The following chunk has the sole scope of downsampling reads to make the vignette lighter. The user should skip this.

```{r}
n = 1000
readsSubset <- sample(seq(nrow(Methylation[[2]])), n)
Methylation[[2]] <- Methylation[[2]][readsSubset,]
```


### Plotting single sites
Already at this stage it is possible to visually inspect results. To that end, we provide the function `PlotAvgSMF` to plot the average SMF signal, defined as 1 - **average methylation**, at a genomic locus of interest. Optionally, the user can pass a GRanges object of genomic coordinates for the transcription factor binding sites (**TFBSs**) of interest to include in the plot, we show an example of such an object. 
```{r, eval=TRUE}
TFBSs <- GenomicRanges::GRanges("chr6", 
                               IRanges(c(88106216, 88106253),
                                       c(88106226, 88106263)), 
                               strand = "-")
elementMetadata(TFBSs)$name <- c("NRF1", "NRF1")
names(TFBSs) <- c(paste0("TFBS_", c(4305215, 4305216)))
TFBSs

PlotAvgSMF(MethGR = Methylation[[1]], 
           range = Region_of_interest, 
           TFBSs = TFBSs)
```

Furthermore, the function `PlotSM` can be used to plot the single molecule SMF information at a given site.
```{r}
PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest)
```

Optionally, hierarchical clustering can be performed by setting the parameter `SortedReads = "HC"`. This can be useful to aggregate reads visually in order to spot PCR artifacts. N.b. reads will be downsampled to 500. 
```{r, eval=TRUE}
PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest, 
       SortedReads = "HC")
```

### Single Molecule Sorting
Ultimately though, we want to classify reads based on their patterns of molecular occupancy. To that end we provide the functions `SortReadsBySingleTF` and `SortReadsByTFCluster` to classify reads based either on the occupancy patterns of one or multiple transcription factors. 

Under the hood, the classification is based on the definition of $n+2$ bins (with $n$ being the number of TFs). The $n$ bins are each centered around one of the TFBSs of interest, while the 2 extra bins are located upstream and downstream of the two outmost TFBSs. 

For `SortReadsBySingleTF`, the coordinates of the bins relative to the center of the TFBS are [-35;-25], [-15;+15], [+25,+35]. Instead, the function `SortReadsByTFCluster` draws a bin with coordinates [-7;+7] around the center of each TFBS, a bin with coordinates [-35;-25] relative to center of the most upstream TFBS and a bin with coordinates [+25,+35] relative to the center of the most downstream TFBS. The user can also employ custom coordinates by specifying them explicitly using the function `SortReads`. 

For each read, the binary methylation status of the cytosines contained in each bin is averaged to either a 0 or a 1 such that each read is eventually represented as sequence of $n+2$ binary digits, for a total of $2^{(n+2)}$ possible sequences. 

Here, we show a usage case for the `SortReadsByTFCluster` function, as we have already identified the double binding of NRF1 at the genomic site under scrutiny. Usage and exploration of the output is identical for the other function, except for the the format of the *TFBSs* argument which should consist of a GRanges object of length 1 for `SortReadsBySingleTF` and of length $>$ 1 for `SortReadsByTFCluster`.
```{r, eval=TRUE}
SortedReads_TFCluster <- SortReadsByTFCluster(MethSM = Methylation[[2]], 
                                             TFBSs = TFBSs)
```

```{r, echo=FALSE}
message(paste0("Number of retrieved states: ", 
               as.character(length(SortedReads_TFCluster))))
message("States population:")
unlist(lapply(SortedReads_TFCluster, length))
```
N.b. non-populated states are not returned. 

The output of each of these sorting functions can be passed directly as the `SortedReads` argument of the `PlotSM` function.
```{r}
PlotSM(MethSM = Methylation[[2]], 
       range = Region_of_interest, 
       SortedReads = SortedReads_TFCluster)
```

N.b. despite sorting reads by a TF cluster is in principle possible with clusters of any size, as of now the `PlotSM` function can only deal with TF pairs.


In order to be quantitative about these observations, the user can employ the `StateQuantificationPlot`. The function outputs a bar plot annotated with the percentage of reads found in each state. The function takes, as argument, the output of either of the two sorting functions.
```{r}
StateQuantificationPlot(SortedReads = SortedReads_TFCluster)
```

Finally, we provide the wrapper function `PlotSingleSiteSMF` to plot at once the three kinds of information detailed above and to export results as a pdf.
```{r, eval=TRUE}
PlotSingleSiteSMF(ContextMethylation = Methylation, 
                  sample = MySample, 
                  range = Region_of_interest, 
                  SortedReads = SortedReads_TFCluster, 
                  TFBSs = TFBSs, 
                  saveAs = NULL)
```

## Session information
This vignette was produced under:
```{r sessionInfo, echo=FALSE}
sessionInfo()
```