\documentclass{article}
\usepackage{fullpage,authblk,placeins}
\usepackage[hidelinks]{hyperref}
%\VignetteIndexEntry{Using genomicInstability}

\title{Using genomicInstability, a package for inferring the genomic instability of cells from scRNA-Seq data}
\author[1,2]{Pasquale Laise}
\author[1,2]{Mariano J. Alvarez}
\affil[1]{DarwinHealth Inc, 333 West 39\textsuperscript{th} Street, New York, NY 10018}
\affil[2]{Department of Systems Biology, Columbia University, 1130 St. Nicholas Ave., New York, NY 10032}
\date{\today}

\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle

%-----------
\section{Overview of genomicInstability}\label{sec:overview}

Cellular heterogeneity is a hallmark of cancer\cite{Hanahan2011}.
Transcriptional analysis of individual cells (single-cell RNA sequencing, scRNA-Seq) has significantly improved our understanding of intratumor heterogeneity, showing that many tumors are composed by normal and neoplastic cells in different phenotypic states---cellular subtypes\cite{Patel2014,Puram2017}.
Consequently, the analysis of tumor-related transcriptional programs requires an accurate classification of each individual cell as putative neoplastic or normal.

Computational approaches for the identification of putative malignant and normal cells from single cell gene expression profiles typically rely on unsupervised cluster analysis, followed by cell type identification based on the expression of known marker genes, and methods based on copy number variation (CNV) inference from single-cell gene expression data\cite{Patel2014,Puram2017,Muller2017,Fan2020,Fan2018,Serin2020}.
Cluster based analyses are affected by multiple factors, including the choice of similarity metric and cluster algorithm, and inclusion/exclusion criteria of individual marker genes used for cluster annotation.
CNV-based methods, instead, are typically computationally intensive, often requiring low-level sequencing data (BAM files) as input\cite{Serin2020}, as well as prior biological knowledge, such as a tumor-matched normal reference samples\cite{Patel2014} or a single-nucleotide variant (SNV) call set\cite{Fan2018}, which are not always available.

Here, we present the implementation of a new genomic instability analysis (GIA) method for identifying malignant cells from single-cell gene expression data, which does not require any prior biological knowledge or matched normal samples.
GIA uses the aREA algorithm\cite{Alvarez2016} to quantitatively estimate the association between gene expression and chromosomal location by performing enrichment analysis of contiguously coded genes (loci-blocks) on the single cell gene expression profiles.
Such associations, which are purely the product of transcriptional co-regulation in normal cells, can also be generated by genomic alterations typical of neoplastic cells, like amplification and deletion of chromosomal regions, collectively referred to as CNVs.
In a mixture of normal and tumor cells, as it is usually the case when tumor samples are profiled at the single-cell level, it is to expect that the genomic aberrations in the tumor cells would account for a globally higher level of loci-block enrichment.
In particular, a higher genomic instability---more CNVs--would be reflected on a larger dynamic range for the enrichment of the loci-blocks on the expression profile.
GIA uses the variance of such enrichment, across all loci-bloks, to quantitatively estimate the level of genomic instability for each cell (Genomic Instability Score, GIS).

The package \texttt{genomicInstability} provides an implementation of the GIA algorithm for the R-system environment, including functions to quantify the association between expression and loci-blocks (\texttt{inferCNV()}); functions to estimate the genomic instability score of each cell (\texttt{genomicInstabilityScore()}); functions to estimate the likelihood of each of the analyzed cells to be genomically unstable (\texttt{giLikelihood()}); as well as functions for the graphical representation of the results, including heatmaps for the loci-blocks and density distribution plots for the genomic instability estimates.

%--------
%\section{Citation}

%--------
\section{Installation of \emph{genomicInstability} package}
The \emph{genomicInstability} package requires the \href{http://www.r-project.org}{R-system}, the \href{https://cran.r-project.org/web/packages/checkmate/index.html}{checkmate package}, the \href{http://www.bioconductor.org}{mixtools package}, and the \href{https://www.bioconductor.org/packages/release/bioc/html/ExperimentHub.html}{ExperimentHub}, \href{https://www.bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html}{SingleCellExperiment} and \href{https://cran.r-project.org/web/packages/pROC/}{pROC} packages to run the examples in this vignette.
After installing R, all required components can be obtained with:

<<echo=TRUE, eval=FALSE, results=hide>>=
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
install.packages("checkmate")
BiocManager::install("mixtools")
BiocManager::install("SingleCellExperiment")
BiocManager::install("ExperimentHub")
BiocManager::install("genomicInstability")
install.packages("pROC")
@

%--------
\section{Use-case example for the analysis of HNSC and normal cells}
For the example analysis in this vignette, we will use the Head and Neck Squamous Carcinoma (HNSC) scRNA-Seq dataset by Puram et. al \cite{Puram2017}.
This dataset is available from Gene Expression Omnibus (GEO) as \href{https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103322}{GSE102233}.

%----
\subsection{Getting started}
As first step, we have to load the genomicInstability environment with:
<<echo=TRUE, results=hide>>=
library(genomicInstability)
@

%------
\subsection{Estimating the association between gene expression and loci-blocks}
For convenience, we will obtain the data and metadata for this example case as a \emph{SingleCellExperiment}-class data object using the package \texttt{ExpressionHub}.

The access number for GSE102233 in ExperimentHub is ``EH5419'' and we can access the data and metadata as a SingleCellExperiment-class (SCE) object with:

<<echo=TRUE, results=hide>>=
library(SingleCellExperiment)
library(ExperimentHub)
eh <- ExperimentHub()
dset <- eh[["EH5419"]]
@

The data in this SCE-class object is a Transcripts-per-million (TPM) matrix for 21,341 known genes (EntrezID) across 5,902 cells.
The metadata includes information regarding the enzyme used for reverse transcription, whether the sample was isolated from lymph nodes or is a primary tumor, and whether the cells are classified as tumor or normal, as well as the type of normal cell according to Puram et. al\cite{Puram2017}.
To reduce execution time, we will limit the analysis in this vignette to a subset of 500 cells, selected uniformly at random---we fix the random seed for reproducibility purposes---from the 5,902 represented in the dataset.

<<echo=TRUE, results=verbatim>>==
tpm_matrix <- assays(dset)$TPM
set.seed(1)
pos <- sample(ncol(tpm_matrix), 500)
tpm_matrix <- tpm_matrix[, pos]
metadata <- colData(dset)
@

The association between gene expression and loci-blocks is estimated by the function \texttt{inferCNV()}.
The minimum input for this function is a matrix of gene expression.
If available, a matrix of gene expression for a normal (non-neoplastic) samples can also be provided.
The matrix of normal samples is not essential nor strictly required by the algorithm, but it provides a better prior for the estimation of CNVs and the tumor vs. normal likelihood (see below).

By default, \texttt{inferCNV()} generates loci-blocks composed by 100 contiguous genes and with 75\% overlap between blocks.
This parameters can be modified by the \texttt{species}, \texttt{k} and \texttt{skip} arguments of the \texttt{inferCNV()} function, which define the species to be analyzed---currently ``human'' and ``mouse'' are implemented---, the number of genes for each loci-block, and the displacement, in number of genes, between loci-blocks, respectively.

For this example, we are going to use only the gene expression matrix and default parameters.
The output of the \texttt{inferCNV{}} function is an inferCNV-class object with slots \texttt{nes}, \texttt{null} and \texttt{param}.
The enrichment of each loci-block on the gene expression profile of each cell/sample is expressed as Normalized Enrichment Scores (NES), and is stored in the \texttt{nes} slot.

<<echo=TRUE, results=verbatim>>=
cnv <- inferCNV(tpm_matrix)
class(cnv)
names(cnv)
dim(cnv$nes)
@

%-------
\subsection{Estimating the Genomic Instability Score (GIS)}
The Genomic Instability Score (GIS) is estimated as the log\textsubscript{2} of the NES variance for each cell, and the computation is performed by the function \texttt{genomicInstabilityScore()}.
This function generates an updated inferCNV-class object including a vector of computed GIS, with length equal to the number of cells---500 for this example---, as the slot \texttt{gis}.

<<echo=TRUE, results=verbatim>>=
cnv <- genomicInstabilityScore(cnv)
names(cnv)
length(cnv$gis)
@

For a dataset consisting on a mixture of normal and tumor cells, we would expect GIS to be distributed with at least two modes (fig. \ref{fig:density-plot}).

<<label=density, fig=TRUE, include=FALSE, echo=TRUE, width=6, height=4>>=
par(mai=c(.8, .8, .2, .2))
plot(density(cnv$gis), lwd=2, xlab="GIS", main="")
@

\begin{figure}
  \centering
  \includegraphics[width=.5\textwidth]{genomicInstability-density}
  \caption{Density distribution for the GIS.}\label{fig:density-plot}
\end{figure}

%----------
\subsection{Estimating the relative likelihood of a cell to be neoplastic}
The relative likelihood for each of the cells of being genomically unstable (neoplastic) can be estimated from a mixture of Gaussian distributions fitted to the GIS estimates.
Parameter estimates for the mixture of Gaussian distributions are computed by the \texttt{giLikelihood()} function.
The output is an updated inferCNV-class object with two new slots: (1) \texttt{cnv\_fit}, which is a list of fitted parameters and fitting metrics, and (2) \texttt{gi\_likelihood}, which contains a vector of estimated likelihood of genomic instability for the analyzed cells.

<<echo=TRUE, results=hide>>=
cnv <- giLikelihood(cnv)
@

<<echo=TRUE, results=verbatim>>=
names(cnv)
names(cnv$gi_fit)
length(cnv$gi_likelihood)
@

The fitted Gaussian models can be visualized with the function \texttt{giDensityPlot()} (Fig. \ref{fig:giDensity}).

<<label=fit, fig=TRUE, include=FALSE, echo=TRUE, width=6, height=4>>=
giDensityPlot(cnv)
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{genomicInstability-fit}
\caption{Distribution density for the GIS (grey filled density curve) and 3 gaussian distributions fitted to it.}\label{fig:giDensity}
\end{figure}

Inspection of the Gaussian distributions fitted to the GIS (see Fig. \ref{fig:giDensity}) suggests that the Gaussian model with smallest mean might represent normal, genomically stable cells, while the 2 Gaussian models with largest mean might represent the genomically unstable, tumor cells.
Because the \texttt{giLikelihood()} function default is to consider the gaussian model with lowest mean as representing genomically stable (normal) cells, while all other fitted models as representing genomically unstable (tumor) cells, we could directly interpret the estimates in the \texttt{gi\_likelihood} as the probability of each of the cells to be genomically unstable.
This behavior can also be forced, in particular for cases not being properly accounted by the \texttt{giLikelihood()} function defaults, indicating the fitted models most likely representing genomically stable (normal) and unstable (tumor) cells:

<<echo=TRUE, results=hide>>=
cnv <- giLikelihood(cnv, recompute=FALSE, normal=1, tumor=2:3)
@

We can see the values for the estimated likelihood displayed over the GIS distribution (Fig. \ref{fig:likelihood}) with the following code:

<<label=likelihood, fig=TRUE, include=FALSE, echo=TRUE, width=6, height=4>>=
# Plotting the density distribution for GIS and fitted models
par(mai=c(.8, .8, .2, .8))
giDensityPlot(cnv, ylim=c(0, .8))
# Adding the likelihood data and second-axis
pos <- order(cnv$gis)
lines(cnv$gis[pos], cnv$gi_likelihood[pos]*.8, lwd=2)
axis(4, seq(0, .8, length=6), seq(0, 1, .2))
axis(4, .4, "Relative likelihood", tick=FALSE, line=1.5)
pos5 <- which.min((.5-cnv$gi_likelihood)^2)
lines(c(rep(cnv$gis[pos5], 2), max(cnv$gis*1.05)), c(0,
  rep(cnv$gi_likelihood[pos5]*.8, 2)), lty=3)
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{genomicInstability-likelihood}
\caption{Distribution density for the GIS (grey filled density curve) and 3 Gaussian distributions fitted to it (shown in red, green and purple; corresponding to the left \emph{y-axis}), and estimated relative likelihood for each cell to be genomically unstable (shown by the black line; corresponding to the right \emph{y-axis}). The dotted line indicates the GIS corresponding to equal probability of a cell of being genomically stable or unstable.}\label{fig:likelihood}
\end{figure}


%----------
\subsection{Using the genomically stable cells as reference for the analysis}

We can use the cells showing the highest likelihood of being genomically stable (Genomic Instability (GI) likelihood $<$ 0.25) as reference when estimating the enrichment of the loci-blocks and computing the GIS.

<<echo=TRUE, results=verbatim>>=
cnv_norm <- inferCNV(tpm_matrix, nullmat=tpm_matrix[, cnv$gi_likelihood<.25,
                                                    drop=FALSE])
names(cnv_norm)
@

Using a set of normal cells as reference allows for a better estimation of the loci-block enrichment (NES), GIS and GI likelihood, which can be estimated by the function \texttt{genomicInstabilityScore()} by setting its argument \texttt{likelihood=TRUE}:

<<echo=TRUE, results=hide>>=
cnv_norm <- genomicInstabilityScore(cnv_norm, likelihood=TRUE)
@

The fitted Gaussian models, as well as the NULL model based on the cells previously considered as ``genomically stable'', can be displayed over the GIS density plot with the function \texttt{giDensityPlot()}.
The following code shows how to add the genomic instability likelihood estimations to the plot (Fig. \ref{fig:likelihoodNorm}):

<<label=fitNull, fig=TRUE, include=FALSE, echo=TRUE, width=6, height=4>>=
# Plotting the density distribution for GIS and fitted models
par(mai=c(.8, .8, .2, .8))
giDensityPlot(cnv_norm, ylim=c(0, 1.1))
# Adding the likelihood data and second-axis
pos <- order(cnv_norm$gis)
lines(cnv_norm$gis[pos], cnv_norm$gi_likelihood[pos], lwd=2, col="blue")
axis(4, seq(0, 1, length=6), seq(0, 1, .2), col="blue", col.axis="blue")
axis(4, .5, "Relative likelihood", tick=FALSE, line=1.5, col.axis="blue")
pos5 <- which.min((.5-cnv_norm$gi_likelihood)^2)
lines(c(rep(cnv_norm$gis[pos5], 2), max(cnv_norm$gis*1.05)),
      c(0, rep(cnv_norm$gi_likelihood[pos5], 2)), lty=3, col="blue")
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{genomicInstability-fitNull}
\caption{Distribution density for the GIS (grey filled density curve) and 2 Gaussian distributions fitted to it (shown in red and purple). The GIS distribution for the NULL model (cells considered genomically stable) is shown by the black line. The estimated relative likelihood for each cell to be genomically unstable is shown by the blue line. The dotted line indicates the GIS corresponding to equal probability of a cell of being genomically stable or unstable.}\label{fig:likelihoodNorm}
\end{figure}

The distribution for the GIS based on the original classification of the cells by Puram et.al \cite{Puram2017} can be displayed over this last plot with the following code (Fig. \ref{fig:liNormClass}):

<<label=fitClass, fig=TRUE, include=FALSE, echo=TRUE, width=6, height=4>>=
metadata_tumor <- as.vector(metadata$classified..as.cancer.cell)[match(colnames(tpm_matrix),
    rownames(metadata))]
metadata_tumor <- as.logical(as.numeric(metadata_tumor))
# Plotting the density distribution for GIS and fitted models
par(mai=c(.8, .8, .2, .8))
giDensityPlot(cnv_norm, ylim=c(0, 1.1))
# Estimating GIS density distributions for normal and tumor cells
gis_normal <- cnv_norm$gis[!metadata_tumor]
gis_tumor <- cnv_norm$gis[metadata_tumor]
den_normal <- density(gis_normal, from=min(gis_normal), to=max(gis_normal))
den_tumor <- density(gis_tumor, from=min(gis_tumor), to=max(gis_tumor))
# Scaling the densities based on the inferred proportions for each group
den_normal$y <- den_normal$y * cnv_norm$gi_fit$lambda[1]
den_tumor$y <- den_tumor$y * sum(cnv_norm$gi_fit$lambda[2])
# Function to add the density plot
addDensity <- function(x, col="grey") {
  polygon(c(x$x[1], x$x, x$x[length(x$x)], x$x[1]), c(0, x$y, 0, 0), col=col)
}
# Adding the density plots
addDensity(den_normal, col=hsv(.6, .8, .8, .4))
addDensity(den_tumor, col=hsv(.05, .8, .8, .4))
# Adding the likelihood data and second-axis
pos <- order(cnv_norm$gis)
lines(cnv_norm$gis[pos], cnv_norm$gi_likelihood[pos], lwd=2, col="darkgreen")
axis(4, seq(0, 1, length=6), seq(0, 1, .2), col="darkgreen", col.axis="darkgreen")
axis(4, .5, "Relative likelihood", tick=FALSE, line=1.5, col.axis="darkgreen")
pos5 <- which.min((.5-cnv_norm$gi_likelihood)^2)
lines(c(rep(cnv_norm$gis[pos5], 2), max(cnv_norm$gis*1.05)),
      c(0, rep(cnv_norm$gi_likelihood[pos5], 2)), lty=3, col="blue")
# Adding the legend
legend(c(.9, 1), c("Normal", "Tumor"), fill=hsv(c(.6, .05), .8, .8, .4), bty="n")
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{genomicInstability-fitClass}
\caption{Distribution density for the GIS (grey filled density curve) and 2 Gaussian distributions fitted to it (shown in red and purple). The GIS distribution for the NULL model (cells considered genomically stable) is shown by the black line. The estimated relative likelihood for each cell to be genomically unstable is shown by the green line. The dotted line indicates the GIS corresponding to equal probability of a cell of being genomically stable or unstable. The GIS for the normal and tumor cells according to Puram et. al \cite{Puram2017} is shown by the solid blue and red distributions, respectively.}\label{fig:liNormClass}
\end{figure}


The results in figure \ref{fig:liNormClass} show a good agreement between GIA-based inferences of genomic instability and the classification performed by Puram et.al.
In fact, we can estimate the statistical significance of the overlap between Puram et.al classification and the GIA-based inferences---i.e. GI likelihood > 0.5---using the Fisher's Exact Test (FET):

<<echo=TRUE, results=verbatim>>=
fisher.test(metadata_tumor, cnv_norm$gi_likelihood>.5, alternative="greater")
@

To explore further this agreement, we can do a Received Operating Characteristic (ROC) curve analysis (Fig. \ref{fig:roc}).
This analysis requires the \texttt{pROC} package, which is available from CRAN (\url{https://cran.r-project.org/web/packages/pROC/}).

<<label=roc, fig=TRUE, include=FALSE, echo=TRUE, width=4, height=4>>=
# Loading the pROC package
library(pROC)
roc_res <- roc(response=as.numeric(metadata_tumor), predictor=cnv_norm$gi_likelihood, auc=TRUE, ci=TRUE)
par(mai=c(.8, .8, .2, .8))
plot(roc_res)
legend("bottomright", c(paste0("AUC: ", round(roc_res$auc, 3)), paste0("CI 95%: ",
  round(roc_res$ci[1], 3), "-", round(roc_res$ci[3], 3))), bty="n")
@

\begin{figure}
\centering
\includegraphics[width=.45\textwidth]{genomicInstability-roc}
\caption{ROC curve, Area Under the Cure (AUC) and its 95\% confidence interval (CI) for the GIS, as predictor of genomically unstable (tumor) cells.}\label{fig:roc}
\end{figure}

%-----
\subsection{Plotting the loci-block enrichment results}

The enrichment for each of the loci-blocks in each of the analyzed cells can be plotted using a heatmap.
This can be done with the \texttt{plot()} function (Fig. \ref{fig:heatmap}).

<<label=heatmap, echo=TRUE>>=
plot(cnv_norm, output="heatmap.png", resolution=120, gamma=2)
@

\begin{figure}
\centering
\includegraphics[width=\textwidth]{heatmap.png}
\caption{Heatmap for the enrichment of loci-blocks on the expression profile of each cell. Loci-blocks (columns) have been organized by chromosome. Cells (rows) have been sorted by GIS and divided in 3 groups: (1) Unstable (GI likelihood $>$ 0.8), (2) Intermediate (0.2 $\geq$ GI likelihood $\geq$ 0.8), and (3) Stable (GI likelihood $<$ 0.2). Red color in the heatmap indicates potential genomic amplifications, while blue color shows potential deletions.}\label{fig:heatmap}
\end{figure}

%---------
\FloatBarrier
\section{Session Information}

<<echo=FALSE, results=verbatim>>=
sessionInfo()
@

%-----------
\begin{thebibliography}{9}
\bibitem{Hanahan2011} Hanahan, D. \& Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646--674 (2011).

\bibitem{Patel2014} Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396--1401 (2014).

\bibitem{Puram2017} Puram, S. V. et al. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell 171, 1611--1624 (2017).

\bibitem{Muller2017} Muller, S. \& Diaz, A. Single-Cell mRNA Sequencing in Cancer Research: Integrating the Genomic Fingerprint. Front. Genet. 8, 73 (2017).

\bibitem{Fan2020} Fan, J., Slowikowski, K. \& Zhang, F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp. Mol. Med. 52, 1452--1465 (2020).

\bibitem{Fan2018} Fan, J. et al. Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Res. 28, 1217--1227 (2018).

\bibitem{Serin2020} Serin Harmanci, A., Harmanci, A. O. \& Zhou, X. CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data. Nat. Commun. 11, 89, doi:10.1038/s41467-019-13779-x (2020)

\bibitem{Alvarez2016} Alvarez, M. J. et al. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 48, 838--47 (2016).

\end{thebibliography}
%----------
\end{document}