---
title: "Plotting single cell data with schex"
author: "Saskia Freytag"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{using_schex}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(ggplot2)
theme_set(theme_classic())
```

Reduced dimension plotting is one of the essential tools for the
analysis of single cell data. However, as the number of cells/nuclei in
these these plots increases, the usefulness of these plots decreases.
Many cells are plotted on top of each other obscuring information, even
when taking advantage of transparency settings. This package provides
binning strategies of cells/nuclei into hexagon cells. Plotting
summarized information of all cells/nuclei in their respective hexagon
cells presents information without obstructions. The package seemlessly
works with the two most common object classes for the storage of single
cell data; `SingleCellExperiment` from the
[SingleCellExperiment](https://bioconductor.org/packages/3.9/bioc/html/SingleCellExperiment.html)
package and `Seurat` from the [Seurat](https://satijalab.org/seurat/)
package.

## Load libraries

```{r setup, message=FALSE}
library(igraph)
library(schex)
library(TENxPBMCData)
library(scater)
library(scran)
library(ggrepel)
```

## Setup single cell data

In order to demonstrate the capabilities of the schex package, I will
use the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely
available from 10x Genomics. There are 2,700 single cells that were
sequenced on the Illumina NextSeq 500. This data is handly availabe in
the [`TENxPBMCData`
package](http://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html).

```{r load}
tenx_pbmc3k <- TENxPBMCData(dataset = "pbmc3k")

rownames(tenx_pbmc3k) <- uniquifyFeatureNames(
    rowData(tenx_pbmc3k)$ENSEMBL_ID,
    rowData(tenx_pbmc3k)$Symbol_TENx
)
```

In the next few sections, I will perform some simple quality control
steps including filtering and normalization. I will then calculate
various dimension reductions and cluster the data. These steps do by no
means constitute comprehensive handling of the data. For a more detailed
guide the reader is referred to the following guides:

-   [Luecken MD and Theis FJ; Current best practices in singleâ€cell
    RNAâ€seq analysis: a tutorial; Molecular Systems Biology,
    2019](https://www.embopress.org/doi/10.15252/msb.20188746)
-   [Lun ATL, McCarthy DJ and Marioni JC; A step-by-step workflow for
    low-level analysis of single-cell RNA-seq data with Bioconductor;
    F1000 Research, 2016](https://f1000research.com/articles/5-2122)

### Filtering

I filter cells with high mitochondrial content as well as cells with low
library size or feature count.

```{r filter-cells}
rowData(tenx_pbmc3k)$Mito <- grepl("^MT-", rownames(tenx_pbmc3k))
colData(tenx_pbmc3k) <- cbind(
    colData(tenx_pbmc3k),
    perCellQCMetrics(tenx_pbmc3k,
        subsets = list(Mt = rowData(tenx_pbmc3k)$Mito)
    )
)
rowData(tenx_pbmc3k) <- cbind(
    rowData(tenx_pbmc3k),
    perFeatureQCMetrics(tenx_pbmc3k)
)

tenx_pbmc3k <- tenx_pbmc3k[, !colData(tenx_pbmc3k)$subsets_Mt_percent > 50]

libsize_drop <- isOutlier(tenx_pbmc3k$total,
    nmads = 3, type = "lower", log = TRUE
)
feature_drop <- isOutlier(tenx_pbmc3k$detected,
    nmads = 3, type = "lower", log = TRUE
)

tenx_pbmc3k <- tenx_pbmc3k[, !(libsize_drop | feature_drop)]
```

I filter any genes that have 0 count for all cells.

```{r filter-genes}
rm_ind <- calculateAverage(tenx_pbmc3k) < 0
tenx_pbmc3k <- tenx_pbmc3k[!rm_ind, ]
```

### Normalization

I normalize the data by using a simple library size normalization.

```{r norm, message=FALSE, warning=FALSE}
tenx_pbmc3k <- scater::logNormCounts(tenx_pbmc3k)
```

### Dimension reduction

I use both Principal Components Analysis (PCA) and Uniform Manifold
Approximation and Projection (UMAP) in order to obtain reduced dimension
representations of the data. Since there is a random component in the
UMAP, I will also set a seed.

```{r dim-red, message=FALSE, warning=FALSE}
tenx_pbmc3k <- runPCA(tenx_pbmc3k)
set.seed(10)
tenx_pbmc3k <- runUMAP(tenx_pbmc3k,
    dimred = "PCA", spread = 1,
    min_dist = 0.4
)
```

### Clustering

I will cluster the data on the PCA representation using Louvain
clustering.

```{r cluster}
snn_gr <- buildSNNGraph(tenx_pbmc3k, use.dimred = "PCA", k = 50)
clusters <- cluster_louvain(snn_gr)
tenx_pbmc3k$cluster <- factor(clusters$membership)
```

## Plotting single cell data

At this stage in the workflow we usually would like to plot aspects of
our data in one of the reduced dimension representations. Instead of
plotting this in an ordinary fashion, I will demonstrate how schex can
provide a better way of plotting this.

#### Calculate hexagon cell representation

First, I will calculate the hexagon cell representation for each cell
for a specified dimension reduction representation. I decide to use
`nbins=40` which specifies that I divide my x range into 40 bins. Note
that this might be a parameter that you want to play around with
depending on the number of cells/ nuclei in your dataset. Generally, for
more cells/nuclei, `nbins` should be increased.

```{r calc-hexbin}
tenx_pbmc3k <- make_hexbin(tenx_pbmc3k,
    nbins = 40,
    dimension_reduction = "UMAP", use_dims = c(1, 2)
)
```

#### Plot number of cells/nuclei in each hexagon cell

First I plot how many cells are in each hexagon cell. This should be
relatively even, otherwise change the `nbins` parameter in the previous
calculation.

```{r plot-density, fig.height=7, fig.width=7}
plot_hexbin_density(tenx_pbmc3k)
```

#### Plot meta data in hexagon cell representation

Next I colour the hexagon cells by some meta information, such as the
majority of cells cluster membership and the median total count in each
hexagon cell.

```{r plot-meta, fig.height=7, fig.width=7}
plot_hexbin_meta(tenx_pbmc3k, col = "cluster", action = "majority")
plot_hexbin_meta(tenx_pbmc3k, col = "total", action = "median")
```

While for plotting the cluster membership the outcome is not too
different from the classic plot, it is much easier to observe
differences in the total count.

```{r plot-meta-trad, fig.height=7, fig.width=7}
plotUMAP(tenx_pbmc3k, colour_by = "cluster")
plotUMAP(tenx_pbmc3k, colour_by = "total")
```

For convenience there is also a function that allows the calculation of
label positions for factor variables. These can be overlayed with the
package `ggrepel`.

```{r plot-meta-label, message=FALSE, fig.height=7, fig.width=7}
label_df <- make_hexbin_label(tenx_pbmc3k, col = "cluster")
pp <- plot_hexbin_meta(tenx_pbmc3k, col = "cluster", action = "majority")
pp + ggrepel::geom_label_repel(
    data = label_df, aes(x = x, y = y, label = label),
    colour = "black", label.size = NA, fill = NA
)
```

#### Plot gene expression in hexagon cell representation

Finally, I will visualize the gene expression of the POMGNT1 gene in the
hexagon cell representation.

```{r plot-gene, fig.height=7, fig.width=7}
gene_id <- "POMGNT1"
plot_hexbin_feature(tenx_pbmc3k,
    type = "logcounts", feature = gene_id,
    action = "mean", xlab = "UMAP1", ylab = "UMAP2",
    title = paste0("Mean of ", gene_id)
)
```

Again it is much easier to observe differences in gene expression using
the hexagon cell representation than the classic representation.

```{r plot-gene-trad, fig.height=7, fig.width=7}
plotUMAP(tenx_pbmc3k, by_exprs_values = "logcounts", colour_by = gene_id)
```

We can overlay the gene expression data with the clusters for
convenience.

```{r, message=FALSE, fig.height=7, fig.width=7}
plot_hexbin_feature_plus(tenx_pbmc3k,
    col = "cluster", type = "logcounts",
    feature = "POMGNT1", action = "mean"
)
```

### Understanding `schex` output as `ggplot` objects

The `schex` packages renders ordinary `ggplot` objects and thus these
can be treated and manipulated using the [`ggplot`
grammar](https://ggplot2.tidyverse.org/). For example the non-data
components of the plots can be changed using the function `theme`.

```{r}
gene_id <- "CD19"
gg <- schex::plot_hexbin_feature(tenx_pbmc3k,
    type = "logcounts", feature = gene_id,
    action = "mean", xlab = "UMAP1", ylab = "UMAP2",
    title = paste0("Mean of ", gene_id)
)
gg + theme_void()
```

The fact that `schex` renders `ggplot` objects can also be used to save
these plots. Simply use `ggsave` in order to save any created plot.

```{r, eval=FALSE}
ggsave(gg, file = "schex_plot.pdf")
```

To find the details of the session for reproducibility, use this:

```{r}
sessionInfo()
```