---
title: "Roadmap to prepare the input matrix for `scTensor`"
author:
- name: Koki Tsuyuzaki
  affiliation: Laboratory for Bioinformatics Research,
    RIKEN Center for Biosystems Dynamics Reseach
- name: Manabu Ishii
  affiliation: Laboratory for Bioinformatics Research,
    RIKEN Center for Biosystems Dynamics Reseach
- name: Itoshi Nikaido
  affiliation: Laboratory for Bioinformatics Research,
    RIKEN Center for Biosystems Dynamics Reseach
  email: k.t.the-answer@hotmail.co.jp
package: scTensor
output:
  BiocStyle::html_document
vignette: |
  %\VignetteIndexEntry{scTensor: 1. Data format and ID conversion}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
We explain the way to create a matrix, in which the row names are **NCBI Gene ID (ENTREZID)**, for specifying an input of `r Biocpkg("scTensor")`.
Typical `r Biocpkg("scTensor")` workflow can be described like below.

```{r scTensor with NCBI Gene ID, eval=FALSE}
library("scTensor")
library("LRBase.XXX.eg.db") # e.g. LRBase.Hsa.eg.db
# Input matrix
input <- ...
sce <- SingleCellExperiment(assays=list(counts = input))
# Color vector
color <- ...
# Celltype vector
label <- ...
cellCellSetting(sce, LRBase.XXX.eg.db, color, label)
```

In `r Biocpkg("scTensor")`, the row names of the input matrix is limited only NCBI Gene ID to cooperate with the other R packages (cf. data("Germline")). Since the user has many different types of the data matrix, here we introduce some situations and the way to convert the row names of the user's matrix as NCBI Gene ID.

# Step.1 : Create gene-level expression matrix

First of all, we have to prepare the expression matrix (gene $\times$ cell).
There are many types of single-cell RNA-Seq (scRNA-Seq) technologies and the situation will be changed by the used experimental methods and quantification tools described below.

## Case I: Gene-level quantification
In Plate-based scRNA-Seq experiment (i.e. Smart-Seq2, Quart-Seq2, CEL-Seq2, MARS-Seq,...etc), FASTQ file is generated in each cell. After the mapping of reads in each FASTQ file to reference genome, the same number of BAM files will be generated.
By using some quantification tools such as [featureCounts](http://bioinf.wehi.edu.au/featureCounts/), or [HTSeq-count](https://htseq.readthedocs.io/en/release_0.11.1/count.html), user can get the expression matrix and used as the input of `r Biocpkg("scTensor")`. These tools simply count the number of reads in union-exon in each gene.
However, these tools do not take "multimap" of not unique reads into consideration and the quantification is not accurate.
Therefore, we recommend the transcript-level quantification and gene-level summarization explained below.

## Case II : Transcript-level quantification
Some quantification tools such as [RSEM](https://deweylab.github.io/RSEM/), [Sailfish](https://www.cs.cmu.edu/~ckingsf/software/sailfish/), [Salmon](https://combine-lab.github.io/salmon/), [Kallisto](https://pachterlab.github.io/kallisto/), and [StringTie](http://ccb.jhu.edu/software/stringtie/index.shtml) use the reference transcriptome instead of genome, and quantify the expression level in each transcript. After the quantification, the transcript-level expression can be summarized to gene-level by using `summarizeToGene` function of `r Biocpkg("tximport")`. [The paper of tximport](https://f1000research.com/articles/4-1521) showed that the gene-level summalization is more accurate than featureCounts.

Note that if you use the reference transcriptome of [GENCODE](https://www.gencodegenes.org/human/stats.html), this step becomes slightly complicated. Although the number of transcripts of GENCODE and that of [Ensembl](https://ensembl.org/Homo_sapiens/Info/Annotation) is almost the same,
and actually, most of the transcript is duplicated in these two databases,
the gene identifier used in GENCODE looks complicated like "ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|".
In such case, firstly only Ensembl Transcript ID should be extracted (e.g. ENST00000456328.2), removed the version (e.g. ENST00000456328), summarized to Ensembl Gene ID by tximport (e.g. ENSG00000223972), and then converted to NCBI Gene ID (e.g. 100287102).

## Case III: UMI-count
In the droplet-based scRNA-Seq experiment (i.e. Drop-Seq, inDrop RNA-Seq, 10X Chromium), unique molecular identifier (UMI) is introduced for avoiding the bias of PCR amplification, and after multiplexing by cellular barcode, digital gene expression (DGE) matrix is generated by counting the number of types of UMI mapped in each gene.

When user perform Drop-seq, [Drop-Seq tool](https://github.com/broadinstitute/Drop-seq) can generate the DGE matrix.

Another tool [Alevin](https://salmon.readthedocs.io/en/latest/alevin.html), which is a subcommand of Salmon is also applicable to Drop-seq data. In such case [tximport] with option "type = 'alevin'" can import the result of Alevin into R and summarize the DGE matrix.

When the user performs 10X Chromium, using [Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) developed by 10X Genomics is straightforward.

Although Cell Ranger is implemented by Python, starting from the files generated by Cell Ranger (e.g. filtered_gene_bc_matrices/{hg19,mm10}/{barcodes.tsv,genes.tsv,matrix.mtx}), `r CRANpkg("Seurat")` can import these files to an R object.

For example, according to the tutorial of Seurat ([Seurat - Guided Clustering Tutorial](https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html)), PBMC data of Cell Ranger can be imported by `Read10X` function and DGE matrix of UMI-count is available by the output of `CreateSeuratObject` function.

```{r Seurat, eval=FALSE}
library("Seurat")

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data,
  project = "pbmc3k", min.cells = 3, min.features = 200)
```

**Note that the matrix is formatted as a sparse matrix of `r CRANpkg("Matrix")` package (MM: Matrix market), but the `r Biocpkg("scTensor")` assumes dense matrix.**
By using `as.matrix` function,
the sparse matrix is easily converted to dense matrix as follows.

```
# Sparse matrix to dense matrix
for_sc <- as.matrix(pbmc.data)
```

# Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

Even after creating the gene-level expression matrix in Step.1,
many kinds of gene-level gene identifiers can be assigned as row names of the matrix such as Ensembl Gene ID, RefSeq, or Gene Symbol.
Again, only NCBI Gene ID can be used as row names of input matrix of `r Biocpkg("scTensor")`.
To do such a task, we originally implemented a function `convertToNCBIGeneID`.
The only user has to prepare for using this function is the 1. input matrix (or data.frame) filled with only numbers, 2. current gene-level gene identifier in each row of the input matrix, and 3. corresponding table containing current gene-level gene identifier (left) and corresponding NCBI Gene ID (right).
The usage of this function is explained below.

## Case I: Ensembl Gene ID to NCBI Gene ID
In addition to 1. and 2., the user has to prepare the 3. corresponding table.
Here we introduce two approaches to assign user's Ensembl Gene ID to NCBI Gene ID.
First approarch is using [Organism DB](https://bioconductor.org/packages/release/BiocViews.html#___OrganismDb) packages such as `r Biocpkg("Homo.sapiens")`, `r Biocpkg("Mus.musculus")`, and `r Biocpkg("Rattus.norvegicus")`.

Using the `select` function of Organism DB, the corresponding table can be retrieved like below.

```{r Ensembl with Organism DB, echo=TRUE}
suppressPackageStartupMessages(library("scTensor"))
suppressPackageStartupMessages(library("Homo.sapiens"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("ENSG00000204531", "ENSG00000181449",
  "ENSG00000136997", "ENSG00000136826")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
# ID conversion
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))
```

Second approarch is using `r Biocpkg("AnnotationHub")` package.

Although only three Organism DB packages are explicitly developed,
even if the data is generated from other species (e.g. Zebrafish, Arabidopsis thaliana),
similar database is also available from `r Biocpkg("AnnotationHub")`,
and `select` function can be performed like below.

```{r Ensembl with AnnotationHub, echo=TRUE}
suppressPackageStartupMessages(library("AnnotationHub"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
LefttoRight <- select(hs,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))
```

## Case II: Gene Symbol to NCBI Gene ID
When using cellranger or `r CRANpkg("Seurat")` to quantify UMI-count (cf. Step1, Case III),
the row names of input matrix might be Gene Symbol,
and have to be converted to NCBI Gene ID.
As well as the Case I described above,
[Organism DB](https://bioconductor.org/packages/release/BiocViews.html#___OrganismDb)
and `r Biocpkg("AnnotationHub")` will support such a task like below.

```{r Gene Symbol with Organism DB, echo=TRUE}
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("POU5F1", "SOX2", "MYC", "KLF4")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
# ID conversion
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))
```

```{r Gene Symbol with AnnotationHub, echo=TRUE}
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
LefttoRight <- select(hs,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
(input <- convertToNCBIGeneID(input, rowID, LefttoRight))
```

# Step.3: Normalize the count matrix

Finally, we introduce some situations to perform some normalization methods of gene expression matrix.

If a user convert a Seurat object to a SingleCellExperient object by using `as.SingleCellExperiment`,
the result of `NormalizeData` function (logcounts) is inherited to the SingleCellExperient object as follows;

```{r Seurat normalization, eval=FALSE}
pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize",
    scale.factor = 10000)
sce <- as.SingleCellExperiment(pbmc2)
assayNames(sce) # counts, logcounts
```

If the user want to use `r Biocpkg("scater")` package,
`calculateCPM` or `normalize` function can calculate the normalized expression values as follows; (see also [the vignette of scater](https://bioconductor.org/packages/release/bioc/vignettes/scater/inst/doc/vignette-intro.html)).

```{r Scater normalization, eval=FALSE}
library("scater")
sce <- SingleCellExperiment(assays=list(counts = input))
cpm(sce) <- calculateCPM(sce)
sce <- normalize(sce)
assayNames(sce) # counts, normcounts, logcounts, cpm
```

Actually, any original normalization can be stored in the sce.
For example, we can calculate the value of count per median (CPMED) as follows;

```{r Original normalization, eval=FALSE}
# User's Original Normalization Function
CPMED <- function(input){
    libsize <- colSums(input)
    median(libsize) * t(t(input) / libsize)
}
# Normalization
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)
```

We recommend using the normcounts slot to save such original normalization values.
After the normalization, such values can be specified by `assayNames` option in `cellCellRanks` `cellCellDecomp` and `cellCellReport` functions.

# Session information {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```