---
title: "Differential cell-type-specific allelic imbalance with airpart"
author: 
  - name: "Wancen Mu, Hirak Sarkar, Avi Srivastava, Kwangbom Choi, Rob Patro, Michael I. Love"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
abstract: |
  Airpart identifies sets of genes displaying differential
  cell-type-specific allelic imbalance across cell types
  or states, utilizing single-cell allelic counts. It makes use of a
  generalized fused lasso with binomial observations of allelic
  counts to partition cell types by their allelic
  imbalance. Alternatively, a nonparametric method for partitioning
  cell types is offered. The package includes a number of
  visualizations and quality control functions for examining single
  cell allelic imbalance datasets.
output: 
  html_document:
    toc: true
    toc_float: true 
    theme: united
    highlight: tango
vignette: |
  %\VignetteIndexEntry{Differential allelic imbalance with airpart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Real data example

Vignette on Larsson 2019 data can be found 
[here](https://htmlpreview.github.io/?https://github.com/Wancen/airpartpaper/blob/main/Larsson2019/Larsson2019.html),
which has allelic single-cell RNA-seq with 4 cell states.

# Simulated data example I

The *airpart* package takes input data of counts from each of two
alleles across genes (rows) and cells (columns) from a single-cell
RNA-seq experiment.

For demonstration in the package vignette, we will simulate some data
using `makeSimulatedData` function provided within the *airpart*
package. We will examine the allelic counts and then perform QC steps
before analyzing the data for allelic imbalance across groups of cells.

## Simulation set-up

The simulated example dataset has 3 gene clusters with differential
allelic imbalance (DAI): 

* the first cluster has pairs of cell types with same allelic ratio
  with 0.2 and 0.8 (larger DAI) 
* the second cluster has balanced allelic ratio
* the third cluster has pairs of cell types with same allelic ratio
  with 0.7 and 0.9 (smaller DAI)

Below we specify a number of simulation settings as arguments to the
simulation function:

* the "noisy" cell count is 2
* the normal cell count is 10
* 4 cell types
* 20 cells within each cell type
* 25 genes within each gene cluster
* overdispersion parameter `theta` in `rbetabinom` is 20 (higher is
  less dispersion)

```{r}
library(airpart)
suppressPackageStartupMessages(library(SingleCellExperiment))
p.vec <- rep(c(0.2, 0.8, 0.5, 0.5, 0.7, 0.9), each = 2)
set.seed(2021)
sce <- makeSimulatedData(
  mu1 = 2, mu2 = 10, nct = 4, n = 20,
  ngenecl = 25, theta = 20, ncl = 3,
  p.vec = p.vec
)
```

```{r}
unique(rowData(sce)) # the true underlying allelic ratios
table(sce$x) # counts of each cell type
assays(sce)[["a1"]][1:5, 1:5] # allelic counts for the effect allele
```

## Required input data

In summary, *airpart* expects a *SingleCellExperiment* object with:

* discrete cell types recorded as a variable `x` in the `colData(sce)`
* effect and non-effect allelic counts as assays `a1` and `a2`

The allelic ratio is calculated as `a1 / (a1 + a2)`.

Note: We assume that the cell types have been either provided by the
experiment, or identified based on total count. We assume the allelic
ratio was not used in determining the cell groupings in `x`.

```{r}
assayNames(sce)
sce$x
```

# Create allelic ratio matrix

In the `preprocess` step, we add a pseudo-count for gene clustering
and visualization (not used for inference later on allelic imbalance
though, which uses original allelic counts). From the heatmap, we can
clearly identify the three gene clusters (across rows), and we also
see cell type differences (across columns). Within each cell type,
there are some cells with noisier estimates (lower total count) than
others. Again, the allelic ratio tells us how much more of the `a1`
allele is expressed, with 1 indicating all of the expression coming
from the `a1` allele and 0 indicating all of the expression coming
from the `a2` allele.

```{r}
sce <- preprocess(sce)
makeHeatmap(sce)
```

# Quality control steps

## QC on cells

We recommend both QC on cells and on genes. We begin with cell allelic
ratio quality control. For details on these metrics, see `?cellQC`.

```{r}
cellQCmetrics <- cellQC(sce, mad_detected = 4)
cellQCmetrics
```

Now define cell filtering automatically or users can manually filter
out based on `sum`,`detected` and `spikePercent`.

```{r}
keep_cell <- (
  cellQCmetrics$filter_sum | # sufficient features (genes)
    cellQCmetrics$filter_detected | # sufficient molecules counted
    # sufficient features expressed compared to spike genes,
    # high quality cells
    cellQCmetrics$filter_spike
)
sce <- sce[, keep_cell]
```

## QC on genes

We also recommend QC on genes for allelic ratio analysis. Note that we
require genes to be expressed in at least 25% of cells within each
cell type and the genes to have high allelic imbalance variation
across cell types. The following code chunk is recommended (not
evaluated here though). If users want to estimate homogeneous cell
type allelic imbalance, they can set `sd = 0` and examine the below
summary step to find interesting gene clusters with weighted mean
deviating from 0.5.

```{r, eval=FALSE}
featureQCmetric <- featureQC(sce)
keep_feature <- (featureQCmetric$filter_celltype &
  featureQCmetric$filter_sd &
  featureQCmetric$filter_spike)
sce <- sce[keep_feature, ]
```

# Gene clustering

*airpart* provides a function to cluster genes by their allelic
imbalance profile across cells (not using cell grouping information,
e.g. `sce$x`). We then recommend providing genes
within a cluster to the partition function. Clustering genes increases
power for detecting cell type partitions, and improves speed as it
reduces the number of times the partition must be estimated.

We provide two methods for gene clustering. 

1. Gaussian Mixture modeling

Gaussian mixture modeling is the default method for gene
clustering. The scatter plot is shown based on top 2 PCs of the
smoothed allelic ratio data. The argument `plot=FALSE` can be used to
avoid showing the plot.

```{r}
sce <- geneCluster(sce, G = 1:4)
metadata(sce)$geneCluster
```

2. Hierarchical clustering 

```{r}
sce.hc <- geneCluster(sce, method = "hierarchical")
metadata(sce.hc)$geneCluster
```

In this simulated dataset case, the clustering is very similar, but on
allelic scRNA-seq datasets, we have found improved clustering with the
Gaussian mixture model approach (more similar genes within cluster,
based on visual inspection of PCA plot and of allelic ratio heatmaps).

# Running airpart for allelic imbalance across groups of cells

## Simple summary table of allelic ratio

We first quickly look at the weighted mean of allelic ratio for each
gene cluster. From this step we will identify the interesting gene
clusters. The mean is calculated, weighting the information from each
gene x cell element of the matrices by the total count.

```{r}
summary <- summaryAllelicRatio(sce)
summary
```

The following step is a complement of the QC on genes step. We
recommend users only run *airpart* when the largest ordered allelic
ratio difference > 0.05 for speed concerns. We find that the allelic
ratio of most of the gene clusters in such cases (small absolute
allelic ratio differences) won't provide enough evidence to detect
differential allelic imbalance.

```{r}
sapply(1:length(summary), function(i) {
  inst <- summary[[i]]
  inst_order <- inst[order(inst$weighted.mean), ]
  max(diff(inst_order$weighted.mean)) > 0.05
})
```

## Experiment-wide beta-binomial over-dispersion

We recommend examining the experiment-wide beta-binomial
over-dispersion, which helps to inform whether to use a binomial
likelihood or a nonparametric approach to partitioning the cell types
by allelic imbalance.

We focus on the first gene cluster (if a gene cluster is not provided,
`estDisp` will choose the largest cluster).

The blue trend line gives the typical values of over-dispersion across
all the genes in the cluster, and across all the cell types
(accounting for differences across the cell types in the expected
ratio).

```{r}
estDisp(sce, genecluster = 1)
```

## Modeling using fused lasso with binomial likelihood

*airpart* offers a method for partitioning cell types using the
generalized fused lasso with binomial likelihood, as implemented in
the *smurf* package. Cell types are merged based on their similarity
of allelic ratios, accounting for excess variance on the ratio from
low counts. The penalization is determined using deviance on held-out
data, with a 1 SE cross-validation rule for favoring smaller models
(more fused cell types).

The fusion step can also taken into account both cell-level and
gene-level baseline effects, through the use of a `formula` 
(see `?fusedLasso` for example).

```{r}
sce_sub <- fusedLasso(sce,
  model = "binomial",
  genecluster = 1, ncores = 1,
  niter = 2
)
```

The partition groups and the penalty $\lambda$ from the fused lasso
are stored in the metadata: 

```{r, results="asis"}
knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)
```

```{r}
metadata(sce_sub)$lambda
```

Above, `ncores` is the number of CPU used for parallelization. As a
guide, one can specify `niter=5` when the `cts` weighted allelic ratio
difference is smaller than 0.1, in order to provide additional
estimator robustness.

### Consensus partition

If you run `niter` > 1, you can use our consensus partition function
to derive the final partition. This function makes use of ensemble
consensus clustering via the *clue* package.

```{r, results='asis', collapse=TRUE}
sce_sub <- consensusPart(sce_sub)
knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)
```

## Modeling using pairwise Mann-Whitney-Wilcoxon extension

An alternative to the fused lasso with binomial likelihood is an
extension we have implemented wherein all pairs cell types are
compared with Mann-Whitney-Wilcoxon rank sum tests. In practice, we
find that when the allelic counts deviates strongly from a binomial
(e.g. large over-dispersion, small values of `theta`), the `wilcoxExt`
function can offer improved performance, in terms of recovery of the
true partition of cell types by allelic imbalance. The partition is
decided based on a loss function motivated by the Bayesian Information
Criteria.

```{r}
thrs <- 10^seq(from = -2, to = -0.4, by = 0.2)
sce_sub_w <- wilcoxExt(sce, genecluster = 1, threshold = thrs)
knitr::kable(metadata(sce_sub_w)$partition, row.names = FALSE)
metadata(sce_sub_w)$threshold
```

## Calculating allelic ratio estimates via beta-binomial model

After *airpart* determines a partition of cell types either by the fused
lasso with binomial likelihood or the nonparametric approach described
above, it uses those fused lasso estimates or weighted means as the
center of a Cauchy prior for posterior estimation of allelic ratios
per cell type and per gene. Posterior mean and credible intervals are
provided. The posterior inference makes use of a beta-binomial
likelihood, and a moderated estimate of the over-dispersion. The prior
from the partition and the moderated estimate of over-dispersion are
provided to the `apeglm` function from the Bioconductor package of the
same name.

Note that the estimates and credible intervals are not equal for cell
types in the same partition and for genes, because in this step we
re-estimate the conditional cell type means per cell type (not per
partition) and account for each gene's moderated estimate of
over-dispersion.

```{r, warning=FALSE, fig.width=12}
sce_sub <- allelicRatio(sce_sub, DAItest = TRUE)
makeForest(sce_sub, showtext = TRUE)
```

Allelic ratio estimates (`ar`) as well as `svalue` and credible
interval (`lower` and `upper`) are stored in `rowData`. Can use `extractResult` function to derive them.

```{r results="asis"}
genepoi <- paste0("gene", seq_len(5))
ar <- extractResult(sce_sub)
knitr::kable(ar[genepoi,])
makeStep(sce_sub[genepoi,])
```

### Derive statistical inference

To derive statistical inference of allelic imbalance(AI), we suggest a low aggregate probability of false-sign-or-small (FSOS) events (s-value < .005) or examine credible intervals not overlapping an allelic ratio of 0.5. Here all selected 5 genes demonstrated AI on each cell type. 


```{r}
s <- extractResult(sce_sub, "svalue")
apply(s[genepoi,],2, function(s){s<0.005})
```

To derive statistical inference of dynamic AI(DAI), raw p values from likelihood ratio test(LRT) and Benjamini-Hochberg (BH) corrected p value are stored in `p.value` and `adj.p.value`, respectively. Here all 25 genes demonstrated DAI across cells. 

```{r}
adj.p <- mcols(sce_sub)$adj.p.value
adj.p < 0.05
```


# Allelic ratio partition and posterior inference, example II

To demonstrate showing partition results on a heatmap, let's make a
more complex simulation, with 8 cell types, in 3 true groups
by allelic ratio. In the code below, we construct the more complex
simulation, run preprocessing, and examine the allelic ratio heatmap.

```{r}
nct <- 8
p.vec <- (rep(c(
  -3, 0, -3, 3,
  rep(0, nct / 2),
  2, 3, 4, 2
), each = 2) + 5) / 10
sce <- makeSimulatedData(
  mu1 = 2, mu2 = 10, nct = nct, n = 30,
  ngenecl = 50, theta = 20, ncl = 3, p.vec = p.vec
)
sce <- preprocess(sce)

cellQCmetrics <- cellQC(sce, mad_detected = 4)
keep_cell <- (
  cellQCmetrics$filter_sum | # sufficient features (genes)
    cellQCmetrics$filter_detected | # sufficient molecules counted
    # sufficient features expressed compared to spike genes,
    # high quality cells
    cellQCmetrics$filter_spike
)
sce <- sce[, keep_cell]

featureQCmetric <- featureQC(sce)
keep_feature <- (featureQCmetric$filter_celltype &
  featureQCmetric$filter_sd &
  featureQCmetric$filter_spike)
sce <- sce[keep_feature, ]

makeHeatmap(sce)
```

We can then perform gene clustering:

```{r}
sce <- geneCluster(sce, G = 1:4)
table(mcols(sce)$cluster)
```

We check for experiment-wide beta-binomial over-dispersion.
Note that larger `theta` (y-axis) corresponds to *less*
over-dispersion. 

We focus on the first gene cluster (if a gene cluster is not provided,
`estDisp` will choose the largest cluster).

```{r}
estDisp(sce, genecluster = 1)
```

We identify an interesting gene cluster and run the fused lasso.

```{r}
sce_sub <- fusedLasso(sce,
  model = "binomial",
  genecluster = 1, ncores = 1
)
```

```{r, results="asis"}
knitr::kable(metadata(sce_sub)$partition, row.names = FALSE)
```

Next we estimate allelic ratios per cell type and per gene, with
credible intervals. For demonstration, we subset to the first 10
genes. 

```{r}
sce_sub2 <- sce_sub[1:10, ]
sce_sub2 <- allelicRatio(sce_sub2)
```

We plot all cell types together, but one can set `ctpoi=c(1,3,7)` to
limit the cell types to be plotted when there are too many cell
types. And one can set `genepoi=c(1,3,7)` or `genepoi=c("gene1","gene3","gene7")` 
to only plot selected genes.


```{r}
makeForest(sce_sub2)
ar <- extractResult(sce_sub2)
knitr::kable(ar)
```

A violin plot with posterior mean allelic ratios (one estimate per
gene) on the y-axis:

```{r}
makeViolin(sce_sub2)
```

Finally, a heatmap as before, but now with the cell types grouped
according to the partition:

```{r}
makeHeatmap(sce_sub2)
```

The heatmap can also be shown ordered by cell type.

```{r}
makeHeatmap(sce_sub2, order_by_group = FALSE)
```

# Session Info

```{r}
sessionInfo()
```