---
title: "`scPCA`: Sparse contrastive principal component analysis"
author: "Philippe Boileau and [Nima Hejazi](https://nimahejazi.org)"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document
bibliography: ../inst/REFERENCES.bib
vignette: >
  %\VignetteIndexEntry{Sparse contrastive principal component analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

```{r setup, echo=FALSE}
library(dplyr)
library(ggplot2)
library(ggpubr)
library(elasticnet)
library(scPCA)

knitr::opts_chunk$set(echo = TRUE)
```

Data pre-processing and exploratory data analysis and are two important steps in
the data science life-cycle. As data sets become larger and the signal weaker,
their importance increases. Thus, methods that are capable of extracting the
signal from such data sets is badly needed. Often, these steps rely on
dimensionality reduction techniques to isolate pertinent information in data.
However, many of the most commonly-used methods fail to reduce the dimensions of
these large and noisy data sets successfully.

Principal component analysis (PCA) is one such method. Although popular for its
interpretable results and ease of implementation, PCA's performance on
high-dimensional often leaves much to be desired. Its results on these large
datasets have been found to be unstable, and it is often unable to identify
variation that is contextually meaningful.

Fortunately, modifications of PCA have been developed to remedy these issues.
Namely, sparse PCA (sPCA) was created to increase the stability of the principal
component loadings and variable scores in high dimensions, and constrastive PCA
(cPCA) was proposed as a method for capturing relevant information in the
high-dimensional data [@abid2018exploring].

Although sPCA and cPCA have proven useful in resolving individual shortcomings
of PCA, neither is capable of tackling the issues of stability and relevance
simultaneously. The goal of this research project is to determine whether a
combination of these methods, dubbed sparse constrastive PCA (scPCA), can
accomplish this task.

---

# Installation

To install the latest stable release of the `scPCA` package from Bioconductor,
use [`BiocManager`](https://CRAN.R-project.org/package=BiocManager):

```{r install_bioc, eval=FALSE}
BiocManager::install("scPCA")
```

Note that development of the `scPCA` package is done via its GitHub repository.
If you wish to contribute to the development of the package or use features that
have not yet been introduced into a stable release, `scPCA` may be installed
from GitHub using [`remotes`](https://CRAN.R-project.org/package=remotes):

```{r install_github, eval=FALSE}
remotes::install_github("PhilBoileau/scPCA")
```

---

# Comparing PCA, SPCA, cPCA and scPCA

A brief comparion of PCA, SPCA, cPCA and scPCA is provided below. All four
methods are applied to a simulated target dataset consisting of `r nrow(toy_df)`
observations and `r ncol(toy_df) - 1` continuous variables. Additionally, each
observation is classified as belonging to one of four classes. This label is
known is known a priori. A background data set comprised of the same number of
variables as the target data set.

The target data was simulated as follows:

* Each of the first 10 variables was drawn from $N(0, 10)$
* For group 1 and 2, variables 11 through 20 were drawn from $N(0, 1)$
* For group 3 and 4, variables 11 through 20 were drawn from $N(3, 1)$
* For group 1 and 3, variables 21 though 30 were drawn from $N(-3, 1)$
* For group 2 and 4, variables 21 though 30 were drawn from $N(0, 1)$

The background data was simulated as follows:

* The first 10 variables were drawn from $N(0, 10)$
* Variables 11 through 20 were drawn from $N(0, 3)$
* Variables 21 through 30 were drawn from $N(0, 1)$

A similar simulation scheme is provided in @abid2018exploring.

## PCA

First, PCA is applied to the target data. As we can see from the figure, PCA is
incapable of creating a lower dimensional representation of the target data
that captures the variation of interest (i.e. the four groups). In fact, no
pair of principal components among the first twelve were able to.

```{r PCA_sim}
# set seed for reproducibility
set.seed(1742)

# load data
data(toy_df)

# perform PCA
pca_sim <- prcomp(toy_df[, 1:30])

# plot the 2D rep using first 2 components
df <- as_tibble(list("PC1" = pca_sim$x[, 1],
                     "PC2" = pca_sim$x[, 2],
                     "label" = as.character(toy_df[, 31])))
p_pca <- ggplot(df, aes(x = PC1, y = PC2, colour = label)) +
  ggtitle("PCA on Simulated Data") +
  geom_point(alpha = 0.5) +
  theme_minimal()
p_pca
```

## Sparse PCA

Much like PCA, the leading components of SPCA -- for varying amounts of
sparsity -- are incapable of splitting the observations into four distinct
groups.

```{r sPCA_sim}
# perform sPCA on toy_df for a range of L1 penalty terms
penalties <- exp(seq(log(10), log(1000), length.out = 6))
df_ls <- lapply(penalties, function(penalty) {
  spca_sim_p <- spca(toy_df[, 1:30], K = 2, para = rep(penalty, 2),
                     type = "predictor", sparse = "penalty")$loadings
  spca_sim_p <- as.matrix(toy_df[, 1:30]) %*% spca_sim_p
  spca_out <- list("SPC1" = spca_sim_p[, 1],
                   "SPC2" = spca_sim_p[, 2],
                   "penalty" = round(rep(penalty, nrow(toy_df))),
                   "label"  = as.character(toy_df[, 31])) %>%
    as_tibble()
  return(spca_out)
})
df <- bind_rows(df_ls)

# plot the results of sPCA
p_spca <- ggplot(df, aes(x = SPC1, y = SPC2, colour = label)) +
  geom_point(alpha = 0.5) +
  ggtitle("SPCA on Simulated Data for Varying L1 Penalty Terms") +
  facet_wrap(~ penalty, nrow = 2) +
  theme_minimal()
p_spca
```

## Contrastive PCA (cPCA)

The first two contrastive principal components of cPCA successfully captured
the variation of interest in the data with the help of the background data set.
To fit contrastive PCA with the `scPCA` function of this package, simply select
no penalization (by setting argument `penalties = 0`):

```{r cPCA_sim}
# run cPCA across with a penalty of zero to recover Abid et al.'s cPCA
cpca_sim <- scPCA(target = toy_df[, 1:30],
                  background = background_df,
                  penalties = 0,
                  n_centers = 4)

# create a dataframe to be plotted
cpca_df <- cpca_sim$x %>%
  as_tibble() %>%
  mutate(label = toy_df[, 31] %>% as.character)
colnames(cpca_df) <- c("cPC1", "cPC2", "label")

# plot the results
p_cpca <- ggplot(cpca_df, aes(x = cPC1, y = cPC2, colour = label)) +
  geom_point(alpha = 0.5) +
  ggtitle("cPCA of Simulated Data") +
  theme_minimal()
p_cpca
```

## scPCA: Sparse Contrastive PCA

The leading sparse contrastive components were also able to capture the
variation of interest, though the clusters corresponding to the class labels
are more loose than those of cPCA. Importantly, the first and second scPC
contain only two and three non-zero loadings, respectively -- a significant
improvement over cPCA, whose first and second cPCs each possess 30 non-zero
loadings, in terms of interpretability.

```{r scPCA_sim}
# run scPCA for using 40 logarithmically seperated contrastive parameter values
# and possible 20 L1 penalty terms
scpca_sim <- scPCA(target = toy_df[, 1:30],
                   background = background_df,
                   n_centers = 4)

# create a dataframe to be plotted
scpca_df <- scpca_sim$x %>%
  as_tibble() %>%
  mutate(label = toy_df[, 31] %>% as.character)
colnames(scpca_df) <- c("scPC1", "scPC2", "label")

# plot the results
p_scpca <- ggplot(scpca_df, aes(x = scPC1, y = scPC2, colour = label)) +
  geom_point(alpha = 0.5) +
  theme_minimal()
p_scpca


# create the loadings comparison plot
load_diff_df <- bind_rows(
  cpca_sim$rotation %>% as.data.frame,
  scpca_sim$rotation %>% as.data.frame) %>%
  mutate(
    sparse = c(rep("0", 30), rep("1", 30)),
    coef = rep(1:30, 2)
  )
colnames(load_diff_df) <- c("comp1", "comp2", "sparse", "coef")

p1 <- load_diff_df %>%
  ggplot(aes(y = comp1, x = coef, fill = sparse)) +
  geom_bar(stat = "identity") +
  xlab("") +
  ylab("") +
  ylim(-1, 1) +
  ggtitle("First Component") +
  scale_fill_discrete(name = "Method", labels = c("cPCA", "scPCA")) +
  theme_minimal()

p2 <- load_diff_df %>%
  ggplot(aes(y = comp2, x = coef, fill = sparse)) +
  geom_bar(stat = "identity") +
  xlab("") +
  ylab("") +
  ylim(-1, 1) +
  ggtitle("Second Component") +
  scale_fill_discrete(name = "Method", labels = c("cPCA", "scPCA")) +
  theme_minimal()

# build full plot via ggpubr
annotate_figure(
  ggarrange(p1, p2, nrow = 1, ncol = 2,
            common.legend = TRUE, legend = "right"),
  top = "Leading Loadings Vectors Comparison",
  left = "Loading Weights",
  bottom = "Variable Index"
)
```

---

# Bioconductor Integration via `SingleCellExperiment`

We now turn to discussing how the tools in the `scPCA` package can be used more
readily with data structures common in computational biology by examining their
integration with the `SingleCellExperiment` container class. For our example, we
will use `splatter` to simulate a scRNA-seq data set using the Splatter
framework [@zappia2017splatter]. This method simulates a scRNA-seq count matrix
by way of a gamma-Poisson hierarchical model, where the mean expression level of
gene $g_i,\; i = 1, \ldots, p$ is sampled from a gamma distribution, and the
count $x_{i, j}, \; j = 1, \ldots, n$ of cell $c_j$ is sampled from a Poisson
distribution with mean equal to the mean expression level of $g_i$.

To start, let's load the required packages and create a simple dataset of 300
cells and 500 genes. The cells are evenly split among three biological groups.
The samples in two of these groups possess genes that are highly differentially
expressed when compared to those in other groups; they comprise the target data.
The genes of the third group of cells are less differentially expressed to the
genes in the target data, and so it is considered the background data set. A
large batch effect is simulated to confound the biological signal.

```{r sce_setup, message=FALSE}
library(splatter)
library(SingleCellExperiment)

# Simulate the three groups of cells. Mask cell heterogeneity with batch effect
params <- newSplatParams(
  seed = 6757293,
  nGenes = 500,
  batchCells = c(150, 150),
  batch.facLoc = c(0.05, 0.05),
  batch.facScale = c(0.05, 0.05),
  group.prob = rep(1/3, 3),
  de.prob = c(0.1, 0.05, 0.1),
  de.downProb = c(0.1, 0.05, 0.1),
  de.facLoc = rep(0.2, 3),
  de.facScale = rep(0.2, 3)
)
sim_sce <- splatSimulate(params, method = "groups")
```

To proceed, we log-transform the raw counts and retain only the 250 most
variable genes. We then split the simulated data into target and background data
sets. Our goal here is to demonstrate a typical assessment of scRNA-seq data
(and data from similar assays) using the tools made available in the `scPCA`
package. A standard analysis would follow a workflow largely similar to the one
below, though without such a computationally convenient data set.

```{r make_sce_subs, message=FALSE}
# rank genes by variance
n_genes <- 250
vars <- assay(sim_sce) %>%
  log1p %>%
  rowVars
names(vars) <- rownames(sim_sce)
vars <- sort(vars, decreasing = TRUE)

# subset SCE to n_genes with highest variance
sce_sub <- sim_sce[names(vars[seq_len(n_genes)]),]
sce_sub

# split the subsetted SCE into target and background SCEs
tg_sce <- sce_sub[, sce_sub$Group %in% c("Group1", "Group3")]
bg_sce <- sce_sub[, sce_sub$Group %in% c("Group2")]
```

Note that we limit our analysis to just `r n_genes` genes in the interest of
time, a typical analysis would generally include a much larger proportion (if
not all) of the genes assayed.

Owing to the flexibility of the `SingleCellExperiment` class, we are able to
generate PCA, cPCA, and scPCA representations of the target data, storing these
in `SingleCellExperiment` object using the `reducedDims` method.

Below, we perform standard PCA on the log-transformed target data, which has
been centered and scaled, and perform both cPCA and cPCA using the `scPCA`
function, storing each in a separate object. After applying each of these
dimension reduction techniques, we store the resultant objects in a `SimpleList`
that is then appended to the `SingleCellExperiment` object using the
`reducedDims` accessor. The results are presented in the following figure. cPCA
and scPCA successfully remove the batch effect, though PCA is incapable of doing
so in two dimensions.

```{r perform_dimred, message=FALSE}
# for both cPCA and scPCA, we'll set the penalties and contrasts arguments
contrasts <- exp(seq(log(0.1), log(100), length.out = 5))
penalties <- seq(0.2, 1, length.out = 3)

# first, PCA
pca_out <- prcomp(t(log1p(counts(tg_sce))), center = TRUE, scale. = TRUE)

# next, cPCA
cpca_out <- scPCA(t(log1p(counts(tg_sce))),
                  t(log1p(counts(bg_sce))),
                  n_centers = 2,
                  n_eigen = 2,
                  contrasts = contrasts,
                  penalties = 0,
                  center = TRUE,
                  scale = TRUE)

# finally, scPCA
scpca_out <- scPCA(t(log1p(counts(tg_sce))),
                   t(log1p(counts(bg_sce))),
                   n_centers = 2,
                   n_eigen = 2,
                   contrasts = contrasts,
                   penalties = penalties,
                   center = TRUE,
                   scale = TRUE)

# store new representations in the SingleCellExperiment object
reducedDims(tg_sce) <- SimpleList(PCA = pca_out$x[, 1:2],
                                  cPCA = cpca_out$x,
                                  scPCA = scpca_out$x)
tg_sce
```

```{r plot_red_dims, echo=FALSE, fig.asp=.25, out.width="1600px"}
# plot the 2D representations

# prepare PCA
pca_df <- data.frame(
  PC1 = pca_out$x[, 1],
  PC2 = pca_out$x[, 2],
  group = tg_sce$Group,
  batch = tg_sce$Batch
)
pca_p <- pca_df %>%
  ggplot(aes(x = PC1, y = PC2, colour = group, shape = batch)) +
  geom_point(alpha = 0.75) +
  ggtitle("PCA") +
  scale_colour_viridis_d(name = "Target Group", labels = c("1", "2"),
                         begin = 0.1, end = 0.9) +
  scale_shape_discrete(name = "Batch") +
  theme_minimal()

# prepare cPCA
cpca_df <- data.frame(
  cPC1 = cpca_out$x[, 1],
  cPC2 = cpca_out$x[, 2],
  group = tg_sce$Group,
  batch = tg_sce$Batch
)
cpca_p <- cpca_df %>%
  ggplot(aes(x = cPC1, y = cPC2, colour = group, shape = batch)) +
  geom_point(alpha = 0.75) +
  ggtitle("cPCA") +
  scale_colour_viridis_d(name = "Target Group", labels = c("1", "2"),
                         begin = 0.1, end = 0.9) +
  scale_shape_discrete(name = "Batch") +
  theme_minimal()

# prepare scPCA
scpca_df <- data.frame(
  scPC1 = scpca_out$x[, 1],
  scPC2 = scpca_out$x[, 2],
  group = tg_sce$Group,
  batch = tg_sce$Batch
)
scpca_p <- scpca_df %>%
  ggplot(aes(x = scPC1, y = scPC2, colour = group, shape = batch)) +
  geom_point(alpha = 0.75) +
  ggtitle("scPCA") +
  scale_colour_viridis_d(name = "Target Group", labels = c("1", "2"),
                         begin = 0.1, end = 0.9) +
  scale_shape_discrete(name = "Batch") +
  theme_minimal()

# combine the plots
ggarrange(
  pca_p, cpca_p, scpca_p,
  nrow = 1,
  common.legend = TRUE,
  legend = "right"
)
```

In the above, we set `n_eigen = 2` in the calls to the `scPCA` function that
generate the cPCA and scPCA output, recovering the rotated gene-level data for
just the first two components of the dimension reduction. While this is done to
be explicit (as `n_eigen = 2` by default), we wish to emphasize that it will
often be appropriate to set this to a higher value in order to recover further
dimensions generated by these techniques, as such additional information may be
useful in exploring further signal in the data at hand. For congruence with cPCA
and scPCA, we retain only the first two dimensions generated by PCA in the
information stored in the `SingleCellExperiment` object.

---

# Parallelization

Finally, note that the `scPCA` function has an argument `parallel` (set to
`FALSE` by default), which facilitates parallelized computation of the various
subroutines required in constructing the output of the `scPCA` function. In a
standard analysis of genomic data, use of this parallelization will be crucial,
thus, each of the core subroutines of `scPCA` has an equivalent parallelized
variant that makes use of the infrastructure provided by the [`BiocParallel`
package](https://bioconductor.org/packages/BiocParallel). In order to make
effective use of this paralleization, one need only set `parallel = TRUE` in a
call to `scPCA` after having registered a particular parallelization back-end
for parallel evaluation as described in the `BiocParallel` documentation. An
example of this form of parallelization follows:

```{r bpscpca, eval=FALSE}
# perform the same operations in parallel using BiocParallel
library(BiocParallel)
param <- MulticoreParam()
register(param)

# perfom cPCA
cpca_bp <- scPCA(
  target = toy_df[, 1:30],
  background = background_df,
  contrasts = exp(seq(log(0.1), log(100), length.out = 10)),
  penalties = 0,
  n_centers = 4,
  parallel = TRUE
)

# perform scPCA
scpca_bp <- scPCA(
  target = toy_df[, 1:30],
  background = background_df,
  contrasts = exp(seq(log(0.1), log(100), length.out = 10)),
  penalties = seq(0.1, 1, length.out = 9),
  n_centers = 4,
  parallel = TRUE
)
```

---

# Session Information

```{r session_info, echo=FALSE}
sessionInfo()
```

---

# References