---
title: "An introduction to miQC"
author: "Ariel Hippen and Stephanie Hicks"
date: "Compiled: `r format(Sys.time(), '%B %d, %Y')`"
bibliography: biblio.bib
output:
    BiocStyle::html_document:
        toc: true
        number_sections: true
        toc_depth: 2
        toc_float:
            collapsed: false

vignette: >
    %\VignetteIndexEntry{miQC}
    %\VignetteEngine{knitr::rmarkdown}
    \usepackage[utf8]{inputenc}
---

# Installation

To install the package, please use the following.

```{r, eval=FALSE}
if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("miQC")
```

```{r options, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(cache=FALSE, error=FALSE, message=FALSE, warning=FALSE)
```

# Introduction

This vignette provides a basic example of how to run miQC, which allows users to
perform cell-wise filtering of single-cell RNA-seq data for quality control. 
Single-cell RNA-seq data is very sensitive to tissue quality and choice of 
experimental workflow; it’s critical to ensure compromised cells and failed cell
libraries are removed. A high proportion of reads mapping to mitochondrial DNA
is one sign of a damaged cell, so most analyses will remove cells with mtRNA
over a certain threshold, but those thresholds can be arbitrary and/or
detrimentally stringent, especially for archived tumor tissues. 
miQC jointly models both the proportion of reads mapping to mtDNA genes and the
number of detected genes with mixture models in a probabilistic framework to
identify the low-quality cells in a given dataset.

For more information about the statistical background of miQC and for biological
use cases, consult the miQC paper [@hippen_miqc_2021].

You'll need the following packages installed to run this tutorial:
```{r}
suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(scRNAseq)
    library(scater)
    library(flexmix)
    library(splines)
    library(miQC)
})
```


## Example data

To demonstrate how to run miQC on a single-cell RNA-seq dataset, we'll use data
from mouse brain cells, originating from an experiment by Zeisel et al 
[@zeisel_brain_2015], and available through the Bioconductor package _scRNAseq_.

```{r}
sce <- ZeiselBrainData()
sce
```

## Scater preprocessing

In order to calculate the percent of reads in a cell that map to mitochondrial
genes, we first need to establish which genes are mitochondrial. For genes
listed as HGNC symbols, this is as simple as searching for genes starting with
_mt-_. For other IDs, we recommend using a _biomaRt_ query to map to chromosomal
location and identify all mitochondrial genes.

```{r}
mt_genes <- grepl("^mt-",  rownames(sce))
feature_ctrls <- list(mito = rownames(sce)[mt_genes])

feature_ctrls
```

_miQC_ is designed to be run with the Bioconductor package _scater_, which has a
built-in function _addPerCellQC_ to calculate basic QC metrics like number of 
unique genes detected per cell and total number of reads. When we pass in our
list of mitochondrial genes, it will also calculate percent mitochondrial reads.

``` {r}
sce <- addPerCellQC(sce, subsets = feature_ctrls)
head(colData(sce))
```

# miQC

## Basic usage

Using the QC metrics generated via _scater_, we can use the _plotMetrics_ 
function to visually inspect the quality of our dataset.

``` {r}
plotMetrics(sce)
```

We can see that most cells have a fairly low proportion of mitochondrial reads,
given that the graph is much denser at the bottom. We likely have many cells
that are intact and biologically meaningful. There are also a few cells that
have almost half of their reads mapping to mitochondrial genes, which are likely
broken or otherwise compromised and we will want to exclude from our downstream
analysis. However, it's not clear what boundaries to draw to separate the two
groups of cells. With that in mind, we'll generate a linear mixture model using
the _mixtureModel_ function.

```{r}
model <- mixtureModel(sce)
```

This function is a wrapper for _flexmix_, which fits a mixture model on our data
and returns the parameters of the two lines that best fit the data, as well as
the posterior probability of each cell being derived from each distribution.

We can look at the parameters and posterior values directly with the functions
``` {r}
parameters(model)
head(posterior(model))
```

Or we can visualize the model results using the _plotModel_ function:
```{r}
plotModel(sce, model)
```

As expected, the cells at the very top of the graph are almost certainly
compromised, most likely to have been derived from the distribution with fewer
unique genes and higher baseline mitochondrial expression. 

We can use these posterior probabilities to choose which cells to keep, and
visualize the consequences of this filtering with the _plotFiltering_ function.

```{r}
plotFiltering(sce, model)
```

To actually perform the filtering and remove the indicated cells from our 
SingleCellExperiment object, we can run the _filterCells_ parameter.

```{r}
sce <- filterCells(sce, model)
sce
```

## Other model types

In most cases, a linear mixture model will be satisfactory as well as simpler,
but _miQC_ also supports some non-linear mixture models: currently polynomials
and b-splines. A user should only need to change the _model_type_ parameter when
making the model, and all visualization and filtering functions will work the
same as with a linear model.

```{r}
sce <- ZeiselBrainData()
sce <- addPerCellQC(sce, subsets = feature_ctrls)

model2 <- mixtureModel(sce, model_type = "spline")
plotModel(sce, model2)
plotFiltering(sce, model2)

model3 <- mixtureModel(sce, model_type = "polynomial")
plotModel(sce, model3)
plotFiltering(sce, model3)
```

_miQC_ is designed to combine information about mitochondrial percentage and
library complexity (number of genes discovered) to make filtering decisions, but
if an even simpler model is preferred, _miQC_ can make a model based only on
mitochondrial information. This can be done by changing the _model_type_ 
parameter to "one_dimensional", which runs a 1D gaussian mixture model. When
library size is not added to the model, it is possible to calculate a single
mitochondrial threshold to apply, which can be directly calculated using 
the _get1DCutoff_ function.

```{r}
model4 <- mixtureModel(sce, model_type = "one_dimensional")
plotModel(sce, model4)
plotFiltering(sce, model4)
get1DCutoff(sce, model4)
```

## Extras

### Changing posterior cutoff

_miQC_ defaults to removing any cell with 75% or greater posterior 
probability of being compromised, but if we want to be more or less stringent, 
we can alter the _posterior_cutoff_ parameter, like so:

```{r}
plotFiltering(sce, model4, posterior_cutoff = 0.9)
```

Note that when performing miQC multiple times on different samples for the same
experiment, it's recommended to select the same _posterior_cutoff_ for all, to
give consistency in addition to the flexibility of sample-specific models.

### Preventing exclusion of low-mito cells

_miQC_ includes two parameters to accomodate unusual and undesired behavior in
the linear distributions. These issues are especially visible in some cancer
datasets with a stringent posterior cutoff. We've included a bare-bones version
of QC data for a high-grade serous ovarian tumor (full version of the data is
available at GEO, accession GSM4816047). 

```{r}
data("hgsoc_metrics")
sce <- SingleCellExperiment(colData = metrics)
model <- mixtureModel(sce)
plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = FALSE,
                keep_all_below_boundary = FALSE)
```

The first issue is the group of cells at the bottom of the distribution getting
marked for removal. These cells happen to be near the x-intercept of the 
compromised cell line, which increases their posterior probability of being
compromised. But since they have decent library complexity and a low 
mitochondrial percentage, so it doesn't make biological sense to exclude them. 
When the _keep_all_below_boundary_ parameter is set to True, as is the default,
any cells below the intact cell line are kept:

```{r}
plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = FALSE,
                keep_all_below_boundary = TRUE)
```

### Preventing U-shaped boundary

The second issue is the U-shape in the boundary between kept and discarded
cells. When this occurs, there will be cells at the bottom of the trough that 
are discarded, but some cells with less library complexity (farther left) and 
higher percentage of mitochondrial reads (higher) -- meaning they are worse in
both of our QC metrics -- will be kept. To avoid this happening, the parameter
_enforce_left_cutoff_ will identify the cell marked for removal with the lowest
mitochondrial percentage, determine its library complexity, and discard all
cells with both lower complexity and higher mitochondrial percentage:

```{r}
plotFiltering(sce, model, posterior_cutoff = 0.6, enforce_left_cutoff = TRUE,
                keep_all_below_boundary = TRUE)
```

This will make a de facto mitochondrial percentage cutoff for all cells with low
library complexity, but will be more permissive for cells with high library
complexity and high mitochondrial percentage, which are more likely to be intact
cells with a biological reason for high mitochondrial expression than their low
library complexity counterparts.

## When not to use miQC

The miQC model is based on the assumption that there are a non-trivial number of
compromised cells in the dataset, which is not true in all datasets. We
recommend running _plotMetrics_ on a dataset before running miQC to see if the
two-distribution model is appropriate. Look for the distinctive triangular shape
where cells have a wide variety of mitochondrial percentages at lower gene 
counts and taper off to lower mitochondrial percentage at higher gene counts, as
can be seen in the Zeisel data.

For example of a dataset where there's not a significant number of compromised
cells, so the two-distribution assumption is not met, look at another dataset
from the _scRNAseq_ package, mouse data from Buettner et al
[@buettner_computational_2015].

```{r}
sce <- BuettnerESCData()
mt_genes <- grepl("^mt-", rowData(sce)$AssociatedGeneName)
feature_ctrls <- list(mito = rownames(sce)[mt_genes])
sce <- addPerCellQC(sce, subsets = feature_ctrls)

plotMetrics(sce)
```

The _mixtureModel_ function will throw a warning if only one distribution is
found, in which case no cells would be filtered. In these cases, we recommend
using other filtering methods, such as a cutoff on mitochondrial percentage or
identifying outliers using median absolute deviation (MAD). 

# Session Information

```{r, echo=FALSE}
## Session info
options(width = 120)
sessionInfo()
```

# References