---
title: "An introduction to the SingleCellAlleleExperiment class"
package: SingleCellAlleleExperiment
author:
- name: Jonas Schuck
  affiliation: 
  - Goethe University, University Hospital Frankfurt, Neurological Institute (Edinger Institute), Frankfurt/Main
  - University Cancer Center, Frankfurt/Main
  - Frankfurt Cancer Institute, Frankfurt/Main
  email: schuck@med.uni-frankfurt.de
- name: Ahmad Al Ajami
  affiliation: 
  - Goethe University, University Hospital Frankfurt, Neurological Institute (Edinger Institute), Frankfurt/Main
  - University Cancer Center, Frankfurt/Main
  - Frankfurt Cancer Institute, Frankfurt/Main
  email: alajami@med.uni-frankfurt.de
- name: Federico Marini
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  - Research Center for Immunotherapy (FZI), Mainz
  email: marinif@uni-mainz.de
- name: Katharina Imkeller
  affiliation: 
  - Goethe University, University Hospital Frankfurt, Neurological Institute (Edinger Institute), Frankfurt/Main
  - University Cancer Center, Frankfurt/Main
  - Frankfurt Cancer Institute, Frankfurt/Main
  email: imkeller@med.uni-frankfurt.de
output: 
   BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{An introduction to the SingleCellAlleleExperiment class}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r options, include=FALSE, echo=FALSE}
library(BiocStyle)
knitr::opts_chunk$set(crop=NULL)
```

# Installation

From Bioconductor:

```{r, eval=FALSE}
if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("SingleCellAlleleExperiment")
```

# Introduction to the workflow

## Biological background and motivation

Immune molecules such as B and T cell receptors, human leukocyte antigens (HLAs) or killer Ig-like receptors (KIRs) are encoded in the genetically most diverse loci of the human genome. Many of these immune genes are hyperpolymorphic, showing high allelic diversity across human populations. In addition, typical immune molecules are polygenic, which means that multiple functionally similar genes encode the same protein subunit. 

However, interactive single-cell methods commonly used to analyze immune cells in large patient cohorts do not consider this. This leads to erroneous quantification of important immune mediators and impaired inter-donor comparability. 

## Workflow for unraveling the immunogenetic diversity in scData

We have developed a workflow that allows quantification of expression and interactive exploration of donor-specific alleles of different immune genes. The workflow is divided into two software packages and one additional data package: 

1. The `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` software package consist of a *[Snakemake](https://snakemake.readthedocs.io/en/stable/)* workflow designed to automate and streamline the genotyping process for immune genes, focusing on key targets such as HLAs and KIRs, and enabling allele-specific quantification from single-cell RNA-sequencing (scRNA-seq) data using donor-specific references. For detailed information of the performed steps and how to utilize this workflow, please refer to its documentation here: `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")`.

2. To harness the full analytical potential of the results, we've developed a dedicated `R` package, `SingleCellAlleleExperiment` presented in this repository. This package provides a comprehensive multi-layer data structure, enabling the representation of immune genes at specific levels, including alleles, genes, and groups of functionally similar genes and thus, allows data analysis across these immunologically relevant, different layers of annotation.

3. The `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")` is an `R/ExperimentHub` data package providing datasets generated and processed by the `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` software package which can be used to explore the data and potential downstream analysis workflows using the here presented novel `SingleCellAlleleExperiment` data structure. Refer to `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")` for more information regarding the available datasets and source of raw data. 

This workflow is designed to support both **10x** and **BD Rhapsody** data, encompassing amplicon/targeted sequencing as well as whole-transcriptome-based data, providing flexibility to users working with different experimental setups.

<div style="float: left;">
![Workflow for scIGD](scIGD_SCAE_workflow_final.png)
**Figure 1:** Overview of the scIGD workflow for unraveling immunogenomic diversity in single-cell data, highlighting the integration of the SingleCellAlleleExperiment package for comprehensive data analysis.
<div style="float: left;">

# Introduction to the `SingleCellAlleleExperiment (SCAE)` class

The `SingleCellAlleleExperiment (SCAE)` class serves as a comprehensive multi-layer data structure, enabling the representation of immune genes at specific levels, including alleles, genes, and groups of functionally similar genes and thus, allows data analysis across these immunologically relevant, different layers of annotation. The implemented data object is derived from the `r BiocStyle::Biocpkg(pkg="SingleCellExperiment")` class and follows similar conventions, where rows should represent features (genes, transcripts) and columns should represent cells.

<br>

<div style="float: left;">
![A schematics of the SingleCellAlleleExperiment class](scae_advanced.png)
**Figure 2:** Scheme of SingleCellAlleleExperiment object structure with lookup table.
</div>


<br>

For the integration of the relevant additional data layers (see **Figure 2**), the quantification data for alleles, generated by the novel `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` software package, is aggregated into two additional data layers via an ontology-based design principle using a lookup table during object generation.

For example, the counts of the alleles `A*01:01:01:01` and `A*02:01:01:01` that are present in the raw input data will be combined into the `HLA-A` immune gene layer (see **Table 1** below). Next, all counts of immune genes corresponding to `HLA-class I` are combined into the `HLA-class I` functional class layer. See the structure of the used lookup table below.

<br>

**Table 1:** Scheme of the lookup table used to aggregate allele information into multiple data layers.

<div style="float: center;">

| Allele       | Gene       | Function    |
| :----------- | :--------- | :---------- |
| A*01:01:01   | HLA-A      | HLA class I |
| A*01:01:01   | HLA-A      | HLA class I |
| ...          | ...        | ...         |
| DRB1*01:01:01| HLA-DRB1   | HLA class II|
</div>

<br>

The resulting `SCAE` data object can be used in combination with established single cell analysis packages like `r BiocStyle::Biocpkg(pkg="scater")` and `r BiocStyle::Biocpkg(pkg="scran")` to perform downstream analysis on immune gene expression, allowing data exploration on functional and allele level. See the vignette for further information and insights on how to perform downstream analysis using exemplary data from the accompanying `R/Experimenthub` package `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`.

## Expected input and dataset description

The read in function of the SCAE package `read_allele_counts()` expects specific files that are generated by the previous steps of the workflow, performed in the `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` package. One input parameter needs to state the path to a directory containing all the input files. The file identifiers can be
specifically stated as parameters. The default file identifiers, as they are outputted from **scIGD** are listed below:
The stated input directory should contain the following files:

- **cells_x_genes.barcodes.txt**  (list of barcodes/cell identifiers)
- **cells_x_genes.features.txt** (list of feature identifiers)
- **cells_x_genes_mtx.mtx**      (Contains the quantification matrix)
- **lookup_table.csv**            (Info for generating multiple data layers)
    
The dataset used for the here shown downstream analysis is taken from the accompanying data package `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`. Specifically we are using the `pbmc_5k` dataset. Find more information about the data, including the source of the raw data here:  `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`.

<hr>

# Exemplary downstream analysis 

## Loading suggested packages 

The following packages are abundant for performing the here stated downstream analysis and visualization. Make sure they are installed to use this vignette.

```{r, message = FALSE}
library(SingleCellAlleleExperiment)
library(scaeData)
library(scater)
library(scran)
library(scuttle)
library(DropletUtils)
library(ggplot2)
library(patchwork)
library(org.Hs.eg.db)
library(AnnotationDbi)
```

## Generation of a SingleCellAlleleExperiment object

Download and save the **"pbmc_10k"** dataset from the accompanying `ExperimentHub/scaeData` package.

```{r}
example_data_10k <- scaeData::scaeDataGet(dataset="pbmc_10k")
```


```{r}
example_data_10k
```

The return value of the `scaeDataGet()` function is a list containing four elements. The `$dir` slot contains the directory where the files downloaded from `ExperimentHub` are saved on your device. The remaining three slots `$barcodes`, `$features`, `$matrix` contain the corresponding file names (named by ExperimentHub). The list is then used to pass the saved information to the related parameters of the `read_allele_counts()` function of the here presented package. See next section.

For the usage of the corresponding lookup table, specify a directory containing the lookup table as seen in the next chunk. For the available example datasets in `scaeData`, you can choose one of the following names. This provides the lookup table to the corresponding dataset:

  - "pbmc_5k_lookup_table.csv" is used for the `pbmc_5k" dataset.
  - "pbmc_10k_lookup_table.csv" is used for the `pbmc_10k" dataset.
  - "pbmc_20k_lookup_table.csv" is used for the `pbmc_20k" dataset.

These are part of the `scaeData` package, but are not fetched from `ExperimentHub` but rather read in as internal files.

```{r}
lookup_name <- "pbmc_10k_lookup_table.csv"
lookup <- utils::read.csv(system.file("extdata", lookup_name, package="scaeData"))
```

```{r}
head(lookup)
```

After the directories of the data files and the lookup table are known and saved, proceed to generate a `SingleCellAlleleExperiment` object.

This is an essential step. The read in function `read_allele_counts()` is used to read in data and generate a `SCAE` object. The `SingleCellAlleleExperiment` Constructor is exported as well. However, as the raw data is processed during the read in, we recommend to use `read_allele_counts()`. Find information about the read in using the Constructor in its function documentation `?SingleCellAlleleExperiment`.

### Description of Read in parameters

If you used the `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` package to generate your input files **(RECOMMENDED AND EXPECTED)**, then state the path containing all expected files to the `sample_dir` parameter in the `read_allele_counts()` function and the corresponding lookup table. In case you renamed the files, specify the new file identifiers in the corresponding parameters `lookup_file`, `barcode_file`, `gene_file`, `matrix_file`, otherwise leave it to the stated default values.

The `read_allele_counts()` function also provides multiple parameters to perform filtering steps on the cells contained in your data. For this, **multiple parameter combinations are possible and showcased below.** Advanced filtering can be performed based on a [**knee plot**](https://liorpachter.wordpress.com/tag/knee-plot/).

The first parameter, where you state which **filter mode** you want to use, gives the following valid options: `filter_mode=c("yes", "no", "custom")`.

  - `filter_mode = "no"`: Default filter mode. Filtering based on **threshold=0**, filtering out columns (cells) with a count sum of 0 over all features.
  - `filter_mode = "yes"` : Advanced filtering is performed on the computed **inflection point of a knee plot** based on barcode ranks. This is an **optional** functionality.
  - `filter_mode = "custom"` : Custom filter mode. Filtering performed on a threshold stated in the `filter_threshold` parameter. (Useful for example after you looked at the knee plot after using filter_mode="yes" and decide to use a different filter threshold)
  
  
Additionally, the `verbose` parameter gives you an option to toggle runtime-information for the different steps during object generation. Messages regarding the filtering process (used thresholds) will not be toggled off if `verbose = FALSE`.

**Exemplary read in using all described filter modes generating a SCAE object are shown in the following code-chunks:**

<hr>

#### `filter_mode="no"`

Default filter mode. Filtering based on **threshold=0**, filtering out columns (cells) with a count sum of 0 over all features.

```{r eval=FALSE, warning=FALSE}
scae <- read_allele_counts(example_data_10k$dir,
                           sample_names="example_data",
                           filter_mode="no",
                           lookup_file=lookup,
                           barcode_file=example_data_10k$barcodes,
                           gene_file=example_data_10k$features,
                           matrix_file=example_data_10k$matrix,
                           filter_threshold=NULL,
                           verbose=FALSE)

```

<hr>

#### `filter_mode="yes"`

Advanced filtering is performed on the computed **inflection point of a knee plot** based on barcode ranks. This is an **optional** functionality. To be able to use this filter mode, the `r BiocStyle::Biocpkg(pkg="DropletUtils") ` package needs to be installed on your machine. 

```{r warning=FALSE}
scae <- read_allele_counts(example_data_10k$dir,
                           sample_names="example_data",
                           filter_mode="yes",
                           lookup_file=lookup,
                           barcode_file=example_data_10k$barcodes,
                           gene_file=example_data_10k$features,
                           matrix_file=example_data_10k$matrix,
                           log=TRUE)
scae
```

If this filter mode is chosen, the information to the corresponding plot including the inflection point used for filtering is stored in `metadata(scae)[["knee_info"]]`:

```{r}
metadata(scae)[["knee_info"]]
```

This can be used to plot the corresponding knee plot and potentially infer a new threshold for filtering (then used with **filter_mode="custom"**). Note that only using `filter_mode="yes"` computes the information for the knee plot. See corresponding code chunks for further information about the other filter modes.

```{r, warning=FALSE, fig.height=5, fig.width=5}

knee_df <- metadata(scae)[["knee_info"]]$knee_df
knee_point <- metadata(scae)[["knee_info"]]$knee_point
inflection_point <- metadata(scae)[["knee_info"]]$inflection_point

ggplot(knee_df, aes(x=rank, y=total)) +
       geom_point() +
       scale_x_log10() +
       scale_y_log10() +
       annotation_logticks() +
       labs(x="Barcode rank", y="Total UMI count") +
       geom_hline(yintercept=S4Vectors::metadata(knee_df)$knee,
                  color="dodgerblue", linetype="dashed") +
       geom_hline(yintercept=S4Vectors::metadata(knee_df)$inflection,
                  color="forestgreen", linetype="dashed") +
       annotate("text", x=2, y=S4Vectors::metadata(knee_df)$knee * 1.2,
                label="knee", color="dodgerblue") +
       annotate("text", x=2.25, y=S4Vectors::metadata(knee_df)$inflection * 1.2,
                label="inflection", color="forestgreen") +  theme_bw()

```

<hr>

#### `filter_mode="custom"`

Custom filter mode. Filtering performed on a threshold stated in the `filter_threshold` parameter.

```{r warning=FALSE, eval=FALSE}
#this is the object used in the further workflow
scae <- read_allele_counts(example_data_10k$dir,
                           sample_names="example_data",
                           filter_mode="custom",
                           lookup_file=lookup,
                           barcode_file=example_data_10k$barcodes,
                           gene_file=example_data_10k$features,
                           matrix_file=example_data_10k$matrix,
                           filter_threshold=282)
scae
```

<hr>

### Description of optional functionalities

Three optional functionalities can be toggled on or off during read in:

 - Optional filtering performed on the inflection point of a computed knee plot based on barcode ranks. Use `filter_mode="yes"` for this optional filter mode. Find more information about the different filter modes in the prior section.
 - Computation of a `logcounts` assay using Library Factors. Use `log=TRUE/FALSE` to toggle on or off accordingly. **Turned on by default**. If you want to compute the logcounts using a different method, be aware that the count matrix is extended (adding additional rows) during the object generation, **its abundant to compute scaling factors only on data layers present in the raw data (non-immune and alleles)**. Otherwise the scaling factors would not be correct. Find information about subsetting the object in sections further below.
 - Computation of additional gene symbols in case the raw data only contains Ensembl gene identifiers. Use `gene_symbols=TRUE/FALSE` to toggle on or off accordingly. **Turned off by default**. Uses the org.HS package.

<hr>

## Showcasing content of object slots

### RowData

Two new classification columns are introduced in the `rowData` slot. Namely the `NI_I` column (classification of each row as `NI = non_immune` or `I = immune`) and `Quant_type` column (classification of each row to which data layer it is corresponding to). Both columns are used jointly to identify each row of the object to its corresponding data layer (see **figure 1**).

```{r}
rowData(scae)
```

### ColData

Contains sample and barcode information. If the `logcounts` assay is computed, find another column containing the `sizeFactors` here.

```{r}
head(colData(scae))
```

### Metadata

Only contains information if `filter_mode="yes"` is used.

```{r}
metadata(scae)[["knee_info"]]
```

<hr>

### Utilize layer specific `getter-functions()`

On top of the established `getters` from the SCE package, new getters are implemented to retrieve the different data layers integrated in the SCAE object.


#### Non-immune genes

```{r}
scae_nonimmune_subset <- scae_subset(scae, "nonimmune")

head(rownames(scae_nonimmune_subset))
```

<br>

#### Alleles

```{r}
scae_alleles_subset <- scae_subset(scae, "alleles")

head(rownames(scae_alleles_subset))
```

<br>

#### Immune genes

```{r}
scae_immune_genes_subset <- scae_subset(scae, "immune_genes")

head(rownames(scae_immune_genes_subset))
```

<br>

#### Functional gene groups

```{r}
scae_functional_groups_subset <- scae_subset(scae, "functional_groups")

head(rownames(scae_functional_groups_subset))
```

<br>

<hr>

## Expression evaluation

Checking the expression for the allele-layer, immune gene layer and functional class layer. Allele identifiers are in the form of `A*02:01:01:01`. The immune genes are in the form of `HLA-A` and the functional classes `HLA_class_I`. Here we see that `HLA_class_I` and `HLA-C` are the most abundant functional class and immune gene respectively, given the underlying dataset.

```{r}
scae_immune_layers_subset <- c(rownames(scae_subset(scae, "alleles")),
                               rownames(scae_subset(scae, "immune_genes")),
                               rownames(scae_subset(scae, "functional_groups")))

scater::plotExpression(scae, scae_immune_layers_subset)
```


<br>

# Downstream analysis on immune genes

In the following sections, main steps for dimensional reduction are performed, offering insights into the different data layers of the SCAE object as well as giving an idea on how to perform immune gene expression analysis. 

## Subsetting the different layers

The non-immune genes are combined with each of the integrated immune gene allele-aware layers to determine three different subsets.

### Non-immune genes + alleles

```{r}
rn_scae_ni <- rownames(scae_subset(scae, "nonimmune"))
rn_scae_alleles <- rownames(scae_subset(scae, "alleles"))

scae_nonimmune__allels_subset <- scae[c(rn_scae_ni, rn_scae_alleles), ]

scae_nonimmune__allels_subset
```

<br>

### Non-immune genes + immune genes

```{r}
rn_scae_i <- rownames(scae_subset(scae, "immune_genes"))

scae_nonimmune__immune <- scae[c(rn_scae_ni, rn_scae_i), ]

scae_nonimmune__immune
```

<br>

### Non-immune genes + functional class

```{r}
rn_scae_functional <- rownames(scae_subset(scae, "functional_groups"))

scae_nonimmune__functional <- scae[c(rn_scae_ni, rn_scae_functional), ]

scae_nonimmune__functional
```


## Dimensional Reduction


### Model variance and HVGs for all data layers

Using the `modelGeneVar()` function prior to `getTopHVGs`. Both functions are part from the `Biocpkg("scran")` package. Compute a list of HVGs for each data layer. Return the top 0.1 % HVGs per layer using `getTopHVGs`.

```{r}
df_ni_a <- modelGeneVar(scae_nonimmune__allels_subset)
top_ni_a <- getTopHVGs(df_ni_a, prop=0.1)
```

```{r}
df_ni_g <- modelGeneVar(scae_nonimmune__immune)
top_ni_g <- getTopHVGs(df_ni_g, prop=0.1)
```

```{r}
df_ni_f <- modelGeneVar(scae_nonimmune__functional)
top_ni_f <- getTopHVGs(df_ni_f, prop=0.1)
```

### PCA

Compute PCA for each layer and store the results in the object. Its Important to make unique identifiers for each layer/run or the results will be overwritten and just saved as `PCA`. Here, the `runPCA` functions from the `Biocpkg("scater")` package is used.

```{r}
assay <- "logcounts"

scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_a,
               exprs_values=assay, name="PCA_a")

scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_g,
               exprs_values=assay, name="PCA_g")

scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_f,
               exprs_values=assay, name="PCA_f")
```

```{r}
reducedDimNames(scae)
```

### t-SNE

The same goes for running t-SNE with the `runTSNE` function from the `Biocpkg("scater")` package. Unique identifiers are stated here for each layer as well. For simplicity, we only compute the t-SNE on the `gene layer`. Information regarding how this process is conducted on the other additional layers, can be seen in the non evaluated code chunks below:

```{r}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_g",  name="TSNE_g")
```

The following two chunks show how the t-SNE could be computed on the `allele` and `functional_group` layer. These chunks are not run by default:

```{r, eval=FALSE}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_a",  name="TSNE_a")
```

```{r, eval=FALSE}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_f",  name="TSNE_f")
```

List of results from the performed reduced dimension analysis. 

```{r}
reducedDimNames(scae)
```


## Visualization of results

Exemplary visualization for the t-SNE results on gene level for immune genes that relate to HLA-class I. In the given dataset, it is the immune genes `HLA-DRB1`, plotted alongside their alleles. This allows for insights into potential genetic differences shown on allele-level.


<hr>

### HLA-A immune gene and alleles

```{r fig3, fig.height = 4, fig.width = 12, fig.align = "center", warning = FALSE, message=FALSE}
tsne <- "TSNE_g"

tsne_g_a  <- plotReducedDim(scae, dimred=tsne, colour_by="HLA-DRB1") +
             ggtitle("HLA-DRB1 gene")

tsne_g_a1 <- plotReducedDim(scae, dimred=tsne, colour_by="DRB1*07:01:01:01") + 
             ggtitle("Allele DRB1*07:01:01:01")

tsne_g_a2 <- plotReducedDim(scae, dimred=tsne, colour_by="DRB1*13:01:01:01") + 
             ggtitle("Allele DRB1*13:01:01:01")

p2 <- tsne_g_a + tsne_g_a1 + tsne_g_a2
p2
```

<hr>

# Additional

As the SCAE object is extending the SCE object, it is also compatible with the `Biocpkg("iSEE")` package for interactive data exploration.

# Session Information

```{r}
sessionInfo()
```