---
title: "Demultiplexing oligonucleotide-labeled scRNA-seq data with demuxmix"
shorttitle: "Demultiplexing droplets with demuxmix"
author:
- name: Hans-Ulrich Klein
  affiliation: Center for Translational and Computational Neuroimmunology,
    Department of Neurology,
    Columbia University Irving Medical Center, New York, NY
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
package: demuxmix
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Demultiplexing cells with demuxmix}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

# Introduction

Droplet-based single-cell RNA sequencing (scRNA-seq) facilitates measuring the
transcriptomes of thousands of cells in a single run. Pooling cells from
different samples or conditions before cell partitioning and library
preparation can significantly lower costs and reduce batch effects. The task of
assigning each cell of a pooled sample to its sample of origin is called
demultiplexing. If genetically diverse samples are pooled, single nucleotide
polymorphisms in coding regions can be used for demultiplexing. When working
with genetically similar or identical samples, an additional experimental step
is required to label the cells with a sample-specific barcode oligonucleotide
before pooling. Several techniques have been developed to label cells or nuclei
with oligonucleotides based on antibodies [@stoeckius; @gaublomme] or lipids
[@mcginnis]. These oligonucleotides are termed hashtag oligonucleotides (HTOs)
and are sequenced together with the RNA molecules of the cells resulting in an
(HTOs x droplets) count matrix in addition to the (genes x droplets) matrix
with RNA read counts.

The *demuxmix* package implements a method to demultiplex droplets based on HTO
counts using negative binomial regression mixture models. *demuxmix* can be
applied to the HTO counts only, but better results are often achieved if the
total number of genes detected per droplet (not the complete transcription
profile) is passed to the method along with the HTO counts to leverage the
positive association between genes detected and HTO counts. Further, *demuxmix*
provides estimated error rates based on its probabilistic mixture model
framework, plots for data quality assessment, and multiplet identification, as
outlined in the example workflows in this vignette. Technical details of the
methods are described in the man pages.


# Installation

The *demuxmix* package is available at [Bioconductor](https://bioconductor.org)
and can be installed via *BiocManager::install*:

```{r installation.install, eval = FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("demuxmix")
```

The package only needs to be installed once. Load the package into an R
session with

```{r installation.loadlib, eval = FALSE}
library(demuxmix)
```


# Quick start

A matrix of raw HTO counts (HTO x cells) and a vector with the number of
detected genes per droplet are needed to run *demuxmix* with default settings.
Empty and low-quality droplets should be removed before running *demuxmix*.
A gene with at least one read is usually considered as detected.
Here, we simulate a small example dataset.

```{r quickstartSimulate}
library(demuxmix)

set.seed(2642)
class <- rbind(
    c(rep(TRUE,  220), rep(FALSE, 200)),
    c(rep(FALSE, 200), rep(TRUE,  220))
)
simdata <- dmmSimulateHto(class)
hto <- simdata$hto
dim(hto)
rna <- simdata$rna
length(rna) == ncol(hto)
```

The dataset consists of 420 droplets with cells labeled with two different HTOs.
The first 200 droplets are singlets labeled with the first HTO, followed by
another 200 singlets labeled with the second HTO. The remaining 20 droplets are
doublets, which are positive for both HTOs. Next, we run *demuxmix* to assign
droplets to HTOs.

```{r quickstartDdemuxmix}
dmm <- demuxmix(hto, rna = rna)
summary(dmm)
classes <- dmmClassify(dmm)
table(classes$HTO)
```

The object *dmm* contains the mixture models used to classify the droplets. The
data frame returned  by *summary* shows that `r sum(classes$HTO == "HTO_1")`
droplets were assigned to *HTO_1* and `r sum(classes$HTO == "HTO_2")` to
*HTO_2*, respectively. Since these results meet our expectations and
the estimated error rates are reasonably low, we ran *dmmClassify* to obtain
the classifications for each droplet as a data frame with one row per droplet.
The first column *HTO* of this data frame contains the final classification
results.

A histogram of the HTO values overlayed with the components from the mixture
model can be plotted for quality control. The following command plots a panel
with one histogram per HTO in the dataset.

```{r quickstartHistogram, fig.cap = "Density histograms overlayed with mixture probability mass function. The density histograms show the distribution of the HTO counts for the first HTO (upper figure) and the 2nd HTO (lower figure). The negative component of the mixture model representing non-tagged cells is shown in blue, and the positive component is in red."}
plotDmmHistogram(dmm)
```


# Demultiplexing droplets with demuxmix

## Example datasets

Two example datasets are introduced in this vignette to illustrate a typical
*demuxmix* workflow. The first dataset is a small simulated dataset used to
generate the plots when building this vignette. The alternative second dataset
is a real dataset and can be downloaded from the `r Biocpkg("ExperimentHub")`
via the `r Biocpkg("scRNAseq")` package. Both datasets can be used to go
through this vignette by running either the first (simulated data) or the
second code block (real data) below. Since the real dataset is much larger,
some commands may take up to one minute to complete, which is why this vignette
was built with the simulated data.

### Simulated dataset

Simulated HTO count data are generated for 650 droplets by the method
*dmmSimulateHto*. The logical matrix *class* defines for each droplet (column)
and HTO (row) whether the droplet is positive or negative for that
HTO. Thus, the 3 x 650 matrix *class* below describes a dataset with 3
hashtags and 650 droplets, of which 50 are doublets (with cells tagged
by *HTO_1* and *HTO_2*). The remaining 600 droplets consist of 3 blocks of 200
singlets tagged by one of the three HTOs each.

```{r simulate, fig.height = 6, fig.cap = "Characteristics of the simulated dataset. A) The histogram of the HTO counts from the first HTO (HTO_1) shows a clear separation between positive and negative droplets. B) The histogram of the second HTO (HTO_2) looks similar, although the positive droplets have a smaller mean count and a larger dispersion. C) The histogram of the third HTO reveals a more extensive overlap between the distributions of the positive and negative droplets. D) The scatter plot shows a positive correlation between the number of detected genes and HTO counts in the simulated data."}
library(demuxmix)
library(ggplot2)
library(cowplot)

set.seed(5636)
class <- rbind(
    c(rep( TRUE, 200), rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 50)),
    c(rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 200), rep( TRUE, 50)),
    c(rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 50))
)
simdata <- dmmSimulateHto(
    class = class,
    mu = c(600, 400, 200),
    theta = c(25, 15, 25),
    muAmbient = c(30, 30, 60),
    thetaAmbient = c(20, 10, 5),
    muRna = 3000,
    thetaRna = 30
)
hto <- simdata$hto
rna <- simdata$rna

htoDf <- data.frame(t(hto), HTO = colSums(hto), NumGenes = rna)
pa <- ggplot(htoDf, aes(x = HTO_1)) +
    geom_histogram(bins = 25)
pb <- ggplot(htoDf, aes(x = HTO_2)) +
    geom_histogram(bins = 25)
pc <- ggplot(htoDf, aes(x = HTO_3)) +
    geom_histogram(bins = 25)
pd <- ggplot(htoDf, aes(x = NumGenes, y = HTO)) +
    geom_point()
plot_grid(pa, pb, pc, pd, labels = c("A", "B", "C", "D"))
```

*dmmSimulateHto* dmmSimulateHto returns a matrix with the simulated HTO counts
and a vector with the simulated number of detected genes for each droplet.
Figure \@ref(fig:simulate) shows that we simulated two HTOs of excellent
quality (HTO_1 and HTO_2) and a third HTO (HTO_3) with more background reads,
complicating the demultiplexing. Further, the simulated data shows a positive
association between the number of detected genes and the number of HTO counts
per droplet, which is often observed in real data.


### Cell line mixture dataset

The cell line mixture dataset from @stoeckius consists of cells from 4 different
cell lines. Three samples were taken from each cell line and tagged with a
different HTO resulting in a total of 12 different HTOs. The downloaded dataset
still contains many potentially empty droplets, which are removed using
*emptyDrops*. Subsequently, the numbers of detected genes are calculated and
the HTO matrix is extracted from the *SingleCellExperiment* object. More
information about the preprocessing and data structures for single-cell data in
Bioconductor can be found in this excellent
[online book](http://bioconductor.org/books/release/OSCA/index.html).

```{r stoeckius, eval = FALSE}
library(demuxmix)
library(ggplot2)
library(cowplot)
library(scRNAseq)
library(DropletUtils)

set.seed(8514)
htoExp <- StoeckiusHashingData(type = "mixed")
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]
rna <- colSums(counts(htoExp) > 0)
hto <- counts(altExp(htoExp))
dim(hto)

htoDf <- data.frame(t(hto[c("HEK_A", "KG1_B", "KG1_C"), ]),
    HTO = colSums(hto), NumGenes = rna
)
pa <- ggplot(htoDf, aes(x = HEK_A)) +
    geom_histogram(binwidth = 10) +
    coord_cartesian(xlim = c(0, 600), ylim = c(0, 500))
pb <- ggplot(htoDf, aes(x = KG1_B)) +
    geom_histogram(binwidth = 10) +
    coord_cartesian(xlim = c(0, 600), ylim = c(0, 500))
pc <- ggplot(htoDf, aes(x = KG1_C)) +
    geom_histogram(binwidth = 10) +
    coord_cartesian(xlim = c(0, 600), ylim = c(0, 500))
pd <- ggplot(htoDf, aes(x = NumGenes, y = HTO)) +
    geom_point(size = 0.1) +
    coord_cartesian(ylim = c(0, 750))
plot_grid(pa, pb, pc, pd, labels = c("A", "B", "C", "D"))
```

The plots generated by the code above reveal that the quality of the different
HTOs varies in the cell line mixture dataset. Most HTOs demonstrate a nicely
separated bimodal distribution, as exemplarily shown for *HEK_A*, but *HG1_C*
and, to a lesser extent, *HG1_B*, demonstrate a larger overlap between the
distributions of the positive and negative droplets. As with the simulated
data, there is a positive association between HTO counts and the detected
number of genes. However, the association appears noisier because four
different cell lines with highly distinct RNA profiles and different cell
surface characteristics (likely influencing HTO labeling) were pooled.
Consequently, the HTO counts and the number of detected genes are very
different between cells from different samples but very similar between cells
from the same sample. This uncommon experimental design makes it difficult to
leverage the association between the number of genes detected and HTO counts
during demultiplexing. With default parameters, *demuxmix* automatically selects
the most appropriate model, and, for this dataset, naive instead of regression
mixture models are used for most HTOs. In addition, the large number of 12
pooled samples further complicates the demultiplexing.


## Running demuxmix

*demuxmix* takes a matrix of HTO counts and a vector with the numbers of
detected genes per droplet as input and returns an object of class *Demuxmix*
containing a mixture model for each HTO. Several additional parameters can be
passed to *demuxmix*, but all these parameters have default values that work
well across many datasets. With the default settings, *demuxmix* automatically
selects either naive mixture models or regression mixture models for each HTO,
depending on which model provides the best separation between positive and
negative droplets.

```{r demuxmix}
dmm <- demuxmix(hto, rna = rna)
dmm
classLabels <- dmmClassify(dmm)
head(classLabels)

summary(dmm)

# Compare demultiplexing results to ground truth from simulation
table(classLabels$HTO, simdata$groundTruth)
```

For the simulated data, the object *dmm* contains three regression mixture
models. We then apply *dmmClassify* to obtain a data frame with one row for
each droplet. The first column contains the classification result. The second
column contains the posterior probability that the assigned HTO is correct.
The last column contains the type of the assignment, which is either
"singlet", "multiplet", "negative" (not tagged by any HTO), or "uncertain"
(posterior probability too small to classify the droplet with confidence). Only
droplets of type singlet should be kept in the dataset. Multiplets of two or
more cells from the same sample cannot be detected at the demultiplexing step.
The comparison with the true labels from the simulation shows that most
droplets were classified correctly.

The parameter *model* can be used to select a specific mixture model. The naive
mixture model selected in the code below does not use any information from the
RNA data. As shown in the following output, the naive mixture model performs
slightly worse than the regression mixture model mainly because more
droplets are assigned to the class "uncertain".

```{r demuxmixNaive}
dmmNaive <- demuxmix(hto, model = "naive")
dmmNaive
classLabelsNaive <- dmmClassify(dmmNaive)
summary(dmmNaive)

# Compare results of the naive model to ground truth from simulation
table(classLabelsNaive$HTO, simdata$groundTruth)
```

Another useful parameter is the acceptance probability *pAcpt*, which can be
passed to the *demuxmix* method to overwrite the default value, or
directly set in the object *dmm* as shown in the code block below. The parameter
is used at the classification step and specifies the minimum posterior
probability required to classify a droplet. If the posterior probability of the
most likely class is smaller than *pAcpt*, the droplet is classified as
"uncertain". Setting *pAcpt* to 0 forces the classification of all droplets.
For HTO datasets of moderate quality, the default value can be lowered to
recover more droplet in the dataset, if a larger error rate is acceptable.
The *summary* method estimates the FDR depending on the current setting
of *pAcpt*.

```{r reclassify}
pAcpt(dmm)
pAcpt(dmm) <- 0.95
summary(dmm)

pAcpt(dmm) <- 0
summary(dmm)
```


## Quality control

The *demuxmix* package implements methods for assessing data quality and
model fit. All plotting methods plot a panel with one graph for each HTO in the
dataset as default. Specific HTOs can be selected via the parameter *hto*. The
most informative plot is probably the histogram of the HTO data overlaid with
the mixture probability mass function, as this plot shows both the raw data
and the model fit.

```{r qualityHistogram, fig.height = 8, fig.cap = "Density histograms overlayed with mixture probability mass functions. The density histogram is shown for each HTO in the simulated dataset. The negative component of the respective mixture model representing non-tagged cells (blue) and the positive component (red) are plotted on top of the histogram. The black curve is the mixture pmf."}
plotDmmHistogram(dmm)
dmmOverlap(dmm)
```

First, the histograms should be used to verify the model fit. The model
fit is good if the mixture pmf closely follows the shape of the histogram.
The model fit in the simulated data is adequate for all three HTOs. Second,
the histogram should be bimodal, and the blue and red components should show
little overlap, which is the case for the first two HTOs (HTO_1 and HTO_2).
The third HTO (HTO_3) was simulated to harbor more background reads (mean
of 60 reads), and, consequently, the blue component is shifted towards the
right. The method *dmmOverlap* calculates the area intersected by the two
components. The area is close to zero for the first two HTOs but
`r round(dmmOverlap(dmm)["HTO_3"], digits = 3)` for HTO_3. As a rule of
thumb, an overlap less than 0.03 can be considered excellent. As seen in
this example dataset, a value around 0.05 will lead to some droplets being
classified as "uncertain" but is still sufficient to accurately
demultiplex the majority of droplets.

The histogram plots can look very different across different real datasets
depending on (i) the HTO sequencing depth, (ii) the number of pooled
samples, (iii) the number of droplets, and (iv) the quality of the HTO
experiment (cross-staining, background HTOs, cell debris). Only the last
factor (iv) relates to the actual quality of the data. Specifically,
(i) and (ii) can both lead to a histogram that looks like a vertical blue
line with an x-offset of 0 and a flat horizontal red line with a y-offset
of 0. The reason is that a large sequencing depth results in large
HTO counts in positive droplets so that the probability mass of the red
component spreads over an extensive range of values on the x-axis and
appears flat compared to the sharp location of the blue background component.
Similarly, a large number of pooled samples, e.g., the 12 samples in the
cell line mixture dataset, causes the red component to appear flat as it
only covers an area of about 1/12 compared to 11/12 covered by the blue
component. In such cases, it is helpful to zoom into the critical part of
the histogram where the red and blue components overlap. For the cell line
mixture dataset, the following command generates a suitable histogram for
the first HTO (panel B).

```{r qualityZoomHistogram, eval = FALSE}
pa <- plotDmmHistogram(dmm, hto=1)
pb <- plotDmmHistogram(dmm, hto=1) +
    coord_cartesian(xlim = c(0, 200), ylim = c(0, 0.01))
plot_grid(pa, pb, labels = c("A", "B"))
```

Another useful quality plot is the histogram of the posterior probability that a
droplet is positive for the respective HTO.

```{r qualityPosteriorP, fig.height = 8, fig.cap = "Histograms of posterior probabilities. Each histogram shows the distribution of the posterior probabilities that a droplet contains a tagged cell. Posterior probabilities were obtained from the mixture model fitted to the respective HTO data."}
plotDmmPosteriorP(dmm)
```

As seen in the first two histograms for HTO_1 and HTO_2, most droplets should
have a posterior probability close to 0 (negative droplet) or close to 1
(positive droplet). Consistent with the previous quality plots, the third HTO
(HTO_3) demonstrates some droplets with posterior probabilities between
0 and 1, reflecting droplets with cells of uncertain origin.

Finally, if regression mixture models were used, the decision boundary in
relation to the number of detected genes and the HTO counts can be plotted.

```{r qualityScatterplot, fig.height = 8, fig.cap = "Decision boundary. The scatter plots show the relation between the number of detected genes and HTO counts for each of the three HTOs. The color indicates the posterior probability. The black dashed line depicts the decision boundary where the posterior probability is 0.5."}
plotDmmScatter(dmm)
```

These plots are only available if regression mixture models were used. A
droplet with many detected genes is required to have more HTO reads in order
to be classified as positive. If naive mixture models were used, the dashed
decision boundaries between blue (negative) and red (positive) droplets would
be vertical lines.

## Comparison to hashedDrops

The method *hashedDrops* in the package `r Biocpkg("DropletUtils")` provides
an alternative approach for demultiplexing HTO-labeled single cell data. The
following code block runs *hashedDrops* and compares the results to *demuxmix*.

```{r hashedDrops, fig.height = 6, fig.cap = "Comparison between demuxmix and hashedDrops. The heatmap depicts the classification results for the simulated dataset obtained from demuxmix on the x-axis and hashedDrops on the y-axis."}
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(reshape2))
hd <- hashedDrops(hto)
hdrops <- rownames(hto)[hd$Best]
hdrops[!hd$Confident] <- "uncertain"
hdrops[hd$Doublet] <- "multiplet"

dmux <- classLabels$HTO
dmux[classLabels$Type == "multiplet"] <- "multiplet"

comp <- melt(as.matrix(table(dmux, hdrops)))
colnames(comp) <- c("demuxmix", "hashedDrops", "Count")
comp$color <- ifelse(comp$Count > 100, "black", "white")
ggplot(comp, aes(x = demuxmix, y = hashedDrops, fill = Count)) +
    geom_tile() +
    scale_fill_viridis_c() +
    geom_text(aes(label = Count), col = comp$color, size = 5)
```

The classifications of both methods are highly concordant. No droplet has been
assigned to different singlet classes by the two approaches. However,
*hashedDrops* assigned more droplets tagged by HTO_3 to the category
"uncertain", which is expected when looking at the lower panel of Figure
\@ref(fig:qualityScatterplot). In contrast to *demuxmix*, *hashedDrops* does
not utilize the positive association between the number of genes and HTO counts
explicitly simulated in this dataset.


# Special usecase: pooling non-labeled with labeled cells

If precious rare cells are pooled with highly abundant cells, labeling the
highly abundant cells only but not the rare cells avoids additional losses of
the rare cells during the labeling process. However, such a design results in
a more challenging demultiplexing task. The real dataset used as an example
in this section consists of rare cerebrospinal fluid cells (non-labeled) and
peripheral blood mononuclear cells (PBMCs) stained with
oligonucleotide-conjugated antibodies. In this design, the â€œnegativeâ€ cells
identified by demuxmix correspond to the CSF cells. The dataset is included
in the *demuxmix* package as a data frame.

```{r csfLoad}
data(csf)
head(csf)

csf <- csf[csf$NumGenes >= 200, ]
nrow(csf)
hto <- t(matrix(csf$HTO, dimnames = list(rownames(csf), "HTO")))
```

The data frame contains the number of HTO reads and the number of detected
genes per droplet in the first two columns. We remove all droplets with less
than 200 detected genes since these droplets are unlikely to contain intact
cells. The HTO counts are then converted into a matrix as required by
*demuxmix*. The matrix has only one row since only the PBMCs were stained.
For this example dataset, CSF cells and PBMCs from two genetically
unrelated donors were pooled so that genetic demultiplexing could be
used to benchmark the HTO-based demultiplexing. The third column contains
the result from the genetic demultiplexing using *freemuxlet* [@kang]. The
fourth column contains *freemuxletâ€™s* logarithmized posterior probability.

```{r csfDemuxmix}
dmm <- demuxmix(hto, rna = csf$NumGenes)
dmm

summary(dmm)

dmmOverlap(dmm)
```

*demuxmix* selected a regression mixture model with a significant
regression coefficient in the positive component, indicating that the number of
detected genes in the stained PBMCs is predictive of the HTO counts observed
in these cells. Although *demuxmix* selected a regression model for the
negative component as well, the smaller coefficient and the larger p-value
suggest that the association is much weaker in the CSF cells. This is common
since a larger amount of background HTOs is required in order to detect the
association in the negative droplets.

Next, we look at some QC plots.

```{r csfPlots, fig.height = 7, fig.cap = "Demultiplexing a pool of labeled PBMCs and non-labeled CSF cells. A) The density histogram overlaid with the mixture pmf shows a good separation between the positive red component (PMBCs) and the negative blue component (CSF cells). B) The scatter plot shows the number of HTO reads (x-axis) versus the number of detected genes (y-axis) on the logarithmic scale. The color indicates the posterior probability of the droplet containing a tagged cell."}
histo <- plotDmmHistogram(dmm)
scatter <- plotDmmScatter(dmm) + coord_cartesian(xlim = c(2, 4))
plot_grid(histo, scatter, labels = c("A", "B"), nrow = 2)
```

Overall, the histogram reveals a good model fit but also shows some background
staining of the CSF cells. The mean of the negative component is 
`r round(demuxmix:::getMu1(dmm@models[["HTO"]], standardize=TRUE), digits=1)`
reads. Still, the overlap with the positive component (mean of
`r round(demuxmix:::getMu2(dmm@models[["HTO"]], standardize=TRUE), digits=1)`)
is reasonably small, and the larger mean of the negative component is probably
driven partly by the large sequencing depth of the HTO library. Moreover,
the scatter plot shows that a smaller set of cells has a lower RNA content
(y-axis) and that those cells require less HTO counts (x-axis) in order
to be classified as positive (red color).

Finally, we use genetic demultiplexing to assess *demuxmix's* performance.
Multi-sample multiplets detected by freemuxlet are removed since multiplets
cannot be detected when just one of two samples is stained. We also
remove cells that were not classified with high confidence by freemuxlet.

```{r csfBenchmark}
class <- dmmClassify(dmm)
highConf <- csf$freemuxlet %in% c("0,0", "1,1") &
    exp(csf$freemuxlet.prob) >= 0.99
table(class$HTO[highConf], csf$freemuxlet[highConf])

# Sensitivity "P(class=PBMC | PBMC)"
sum(csf$freemuxlet[highConf] == "0,0" & class$HTO[highConf] == "HTO") /
    sum(csf$freemuxlet[highConf] == "0,0" & class$HTO[highConf] != "uncertain")

# Specificity "P(class=CSF | CSF)"
sum(csf$freemuxlet[highConf] == "1,1" & class$HTO[highConf] == "negative") /
    sum(csf$freemuxlet[highConf] == "1,1" & class$HTO[highConf] != "uncertain")
```

With the default acceptance probability of 0.9, *demuxmix* achieved a
sensitivity and specificity above 95%. Only
`r sum(class$HTO[highConf] == "uncertain")` cells were classified as
"uncertain" and have to be discarded.

For comparison, we run *demuxmix* again and, this time, manually select the
naive mixture model.

```{r csfBenchmarkNaive}
dmmNaive <- demuxmix(hto, model = "naive")
class <- dmmClassify(dmmNaive)
table(class$HTO[highConf], csf$freemuxlet[highConf])

# Sensitivity "P(class=PBMC | PBMC)"
sum(csf$freemuxlet[highConf] == "0,0" & class$HTO[highConf] == "HTO") /
    sum(csf$freemuxlet[highConf] == "0,0" & class$HTO[highConf] != "uncertain")

# Specificity "P(class=CSF | CSF)"
sum(csf$freemuxlet[highConf] == "1,1" & class$HTO[highConf] == "negative") /
    sum(csf$freemuxlet[highConf] == "1,1" & class$HTO[highConf] != "uncertain")
```

The naive model achieved a slightly lower sensitivity and specificity than the
regression mixture model. In addition, more cells were classified as
"uncertain", demonstrating the benefit of modeling the relationship between
the number of detected genes and HTO counts.


# Session Info
```{r sessionInfo}
sessionInfo()
```


# References