---
title: "HiLDA: a package for testing the burdens of mutational signatures"
author: 
- name: Zhi Yang
  affiliation: Department of Preventive Medicine, 
    University of Southern California, Los Angeles, USA 
  email: zyang895@gmail.com
date: "`r Sys.Date()`"
vignette: |
  %\VignetteIndexEntry{An introduction to HiLDA}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
abstract: | 
  Instructions on using _HiLDA_ on testing the burdens of mutational 
  signatures. 

---

```{r style, echo = FALSE, results = 'asis'}
library(BiocStyle)
```

# Introduction

The R package **HiLDA** is developed under the Bayesian framework to allow 
statistically testing whether there is a change in the mutation burdens of 
mutation signatures between two groups. The mutation signature is defined based 
on the independent model proposed by Shiraishi's et al. 

## Paper

- Shiraishi et al. A simple model-based approach to inferring and visualizing 
cancer mutation signatures, bioRxiv, doi: 
[http://dx.doi.org/10.1101/019901](http://dx.doi.org/10.1101/019901).

- **Zhi Yang**, Priyatama Pandey, Darryl Shibata, David V. Conti, 
Paul Marjoram, Kimberly D. Siegmund. HiLDA: a statistical approach to 
investigate differences in mutational signatures, bioRxiv, 
doi: https://doi.org/10.1101/577452 

# Installing and loading the package {#installation}

## Installation

### Bioconductor

**HiLDA** requires several CRAN and Bioconductor R packages to be
installed. Dependencies are usually handled automatically, when installing the
package using the following commands:

```
install.packages("BiocManager")
BiocManager::install("HiLDA")
```
[NOTE: Ignore the first line if you already have installed the
`r CRANpkg("BiocManager")`.]

You can also download the newest version from the GitHub using *devtools*:
```
devtools::install_github("USCbiostats/HiLDA")
```

### Just Another Gibbs Sampler (JAGS)

In order to run HiLDA, one also needs to install an external program called Just
Another Gibbs Sampler, JAGS, downloaded from this website 
http://mcmc-jags.sourceforge.net/. For more details, please follow the INSTALL 
file to install the program. 


# Input data

`HiLDA` is a package built on some basic functions from `pmsignature` including 
how to read the input data. Here is an example from `pmsignature` on the input 
data, *mutation features* are elements used for categorizing mutations such as: 
  
* 6 substitutions (C>A, C>G, C>T, T>A, T>C and T>G)
* 2 flanking bases (A, C, G and T)
* transcription direction.

## Mutation Position Format

    sample1 chr1  100	A	C	
    sample1	chr1	200	A	T	
    sample1	chr2	100	G	T	
    sample2	chr1	300	T	C	
    sample3	chr3	400	T	C	
  
* The 1st column shows the name of samples 
* The 2nd column shows the name of chromosome 
* The 3rd column shows the coordinate in the chromosome
* The 4th column shows the reference base (A, C, G, or T).
* The 5th colum shows the alternate base (A, C, G, or T).


# Workflow 
## Get input data
Here, *inputFile* is the path for the input file. *numBases* is the number of 
flanking bases to consider including the central base (if you want to consider 
two 5' and 3' bases, then set 5). Also, you can add transcription direction 
information using *trDir*. *numSig* sets the number of mutation signatures 
estimated from the input data. You will see a warning message on some mutations
are being removed. 

```{r}
library(HiLDA)
inputFile <- system.file("extdata/esophageal.mp.txt.gz", package="HiLDA")
G <- hildaReadMPFile(inputFile, numBases=5, trDir=TRUE)
```

Also, we also provided a small simulated dataset which contains 10 mutational 
catalogs andused it for demonstrating the key functions in HiLDA. We start with 
loading the sample dataset G stored as extdata/sample.rdata.
```{r}
load(system.file("extdata/sample.rdata", package = "HiLDA"))
class(G)
```

If you'd like to use the USC data in the manuscript, please download the data 
from the OSF home page https://osf.io/a8dzx/

## Run tests from `HiLDA`
### Perform the global test and the local test 
After we read in the sample data G, we can run the local and the global tests 
from HiLDA. Here, we specify the *inputG* as *G*, the number of mutational 
signatures to be three, the indices for the reference group to be 1:4, the 
number of iterations to be 1000. *localTest* being *FALSE* means that a global 
test is called while it being *TRUE* means that a local test is called instead. 
```{r}
set.seed(123)
hildaGlobal <- hildaTest(inputG=G, numSig=3, localTest=FALSE, 
                         refGroup=1:4, nIter=1000)
hildaLocal <- hildaTest(inputG=G, numSig=3, localTest=TRUE, 
                        refGroup=1:4, nIter=1000)
```


## Get signatures from *pmsignature*
This object is used to provide an initial values for running MCMC sampling to 
reduce the running time by using the EM algorithm from *pmsignature* package 
developed by Shiraishi et al.

```{r}
Param <- pmgetSignature(G, K = 3)
```

### Perform the global test and the local test with initial values
In a very similar way as running the HiLDA test, one just needs to specify 
*useInits* to be *Param* returned by the previous function to allow the initial 
values to be used in the MCMC sampling. 
```{r}
set.seed(123)
hildaGlobal <- hildaTest(inputG=G, numSig=3, useInits = Param,
                         localTest=TRUE, refGroup=1:4, nIter=1000)
hildaLocal <- hildaTest(inputG=G, numSig=3, useInits = Param,
                        localTest=TRUE, refGroup=1:4, nIter=1000)
```

###  Assess Convergence of MCMC chains
After the MCMC sampling finishes, we can compute the potential scale reduction 
statistic to examine the convergence of two chains. Usually it is recommended 
to be less than 1.10. If not, it can be done by increasing the number of 
*nIter*. 
```{r}
hildaRhat(hildaGlobal)
hildaRhat(hildaLocal)
```

## Visualize the mutation signatures from pmsignature
To allow users to compare the mutational signatures from both pmsignature and 
HiLDA, this function is used to plot the results from pmsignature. 
```{r}
pmPlots <- pmBarplot(G, Param, refGroup=1:4, sigOrder=c(1,3,2))
cowplot::plot_grid(pmPlots$sigPlot, pmPlots$propPlot, rel_widths = c(1,3))
```

## Visualize the mutation signatures from HiLDA
In contrast, the following function is used to plot the results from HiLDA. 
```{r}
hildaPlots <- hildaBarplot(G, hildaLocal, refGroup=1:4, sigOrder=c(1,3,2))
cowplot::plot_grid(pmPlots$sigPlot, pmPlots$propPlot, rel_widths = c(1,3))
```

## Output the posterior distribution of the mean difference in exposures
To visualize the 95% credible interval of the mean differences in exposures, the
following function plots the differences along with the mutational signatures. 
```{r}
hildaDiffPlots <- hildaDiffPlot(G, hildaLocal, sigOrder=c(1,3,2))
cowplot::plot_grid(hildaDiffPlots$sigPlot, hildaDiffPlots$diffPlot, 
                   rel_widths = c(1,3))
```