---
title: "Introducing mRNA bias into simulations with scaling factors"
author: "Alexander Dietrich"
date: "11/29/2021"
bibliography: references.bib
biblio-style: apalike
link-citations: yes
colorlinks: yes
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introducing mRNA bias into simulations with scaling factors}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Using scaling factors 

This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways. \
The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells. 

```{r}
# Example data
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_", rep(1:300))
colnames(tpm) <- paste0("cell_", rep(1:300))
rownames(counts) <- paste0("gene_", rep(1:1000))
rownames(tpm) <- paste0("gene_", rep(1:1000))
annotation <- data.frame(
  "ID" = paste0("cell_", rep(1:300)),
  "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)),
  "spikes" = stats::runif(300),
  "add_1" = stats::runif(300),
  "add_2" = stats::runif(300)
)
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset"
)
```

## Pre-defined scaling factors

Some studies have proposed scaling factors for immune cells, such as EPIC [@Racle2017] or quanTIseq [@Finotello2019],
deconvolution tools which correct for the mRNA bias internally using these values:

```{r, format="markdown"}
epic <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells",
    "T cells CD4",
    "T cells CD8",
    "T helper cells",
    "T regulatory cells",
    "otherCells",
    "default"
  ),
  mRNA = c(
    0.4016,
    1.4196,
    1.4196,
    0.1300,
    0.4396,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.4000,
    0.4000
  )
)
epic
```

```{r, format="markdown"}
quantiseq <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "MacrophagesM2",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells CD4",
    "T cells CD8",
    "T regulatory cells",
    "Dendritic cells",
    "T cells"
  ),
  mRNA = c(
    65.66148,
    138.11520,
    119.35447,
    130.65455,
    27.73634,
    117.71584,
    63.87200,
    70.25659,
    72.55110,
    140.76091,
    68.89323
  )
)
quantiseq
```


When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the `scaling_factor` parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.

```{r}
sim_epic <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "epic",
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```


We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to `custom_scaling_vector` will be transformed with a scaling factor of 1, meaning nothing changes for them.

```{r}
sim_extreme_b <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "custom",
  custom_scaling_vector = c("T cells CD8" = 10),
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```

**Important:** Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.

## Dataset specific scaling factors

You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive. 

### Reads and genes

Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation. 

```{r}
sim_reads <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "read_number", # use number of reads as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```

These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, `add_1` and `add_2` are such measurements. With the `additional_cols` parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:

```{r, format="markdown"}
utils::head(annotation)
```

```{r}
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  additional_cols = c("add_1", "add_2")
) # add columns to dataset
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "add_1", # use add_1 column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```



### Spike-ins

One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the `spike_in_col` parameter, where you indicate the name of the column from the `annotation` dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the `read_number_col` parameter).

The scaling factor with spike-ins is calculated as the "% of reads NOT mapped to spike-in reads", or: `(n_reads - n_spike_in)/n_reads` for each cell. We apply it like this:

```{r, eval=FALSE}
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  spike_in_col = "spikes"
) # give the name in the annotation file, that contains spike-in information
sim_spike <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "spike_in", # use spike-in scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```


## Census - estimate mRNA counts per cell

[Census](http://cole-trapnell-lab.github.io/monocle-release/docs/#census) is an 
approach which tries to convert TPM counts into relative transcript counts. This 
basically means, you get the mRNA counts per cell, which can differ between cell-types. \
[@Qiu2017] state in their paper, that it should only be applied to TPM/FPKM normalized
data, but I tried it out with raw expression counts as well, which worked as well. \
Census calculates a vector with a scaling value for each cell in a sample. 
You can switch this feature on, by setting the `scaling_factor` parameter to `census`.

```{r}
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  tpm_matrix = tpm,
  name = "test_dataset"
)
sim_census <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "census", # use census scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
```


In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.

```{r}
utils::sessionInfo()
```


# References