# 1. Introduction

Alternative splicing represents an additional and under-appreciated layer of complexity underlying gene expression profiles. More recently, technological advances in library preparation methodologies enabled capturing and amplification of full-length cDNAs from single cells. Thus, paving the way for splicing analysis at single-cell resolution.

Nevertheless, single-cell splicing analysis comes with its own set of challenges including, but not limited to, low coverage of lowly-expressed genes, low capture rate of cDNA molecules, and amplification bias (Wen et al., 2020). To date, there remains a paucity of peer-reviewed softwares available for single-cell splicing analysis. Notable examples are BRIE (Huang & Sanguinetti, 2017) and Expedition (Song et al., 2017).

Here, we introduce MARVEL (Modality Assessment to ReVEaL alternative splicing dynamics at single-cell resolution). MARVEL includes features that complement existing softwares in order to more comprehensively describe and reveal splicing dynamics at single-cell resolution.

The following is a quick comparison against two other leading published softwares for single-cell alternative splicing analysis:

* BRIE incorporates sequence features to model PSI values. These features were identified and trained on human and mouse data only.

## Objectives

The main objectives of MARVEL are

(1) Compute PSI values for all five main exon-level splicing events, i.e. skipped-exon (SE), mutually-exclusive exons (MXE), retained-intron (RI), alternative 5' splice site (A5SS), and alternative 3' splice site (A3SS).

(2) Stratify PSI distribution for each splicing event into the five main modalities, i.e. included, excluded, bimodal, middle, and multimodal. Further stratify included and excluded into primary and dispersed sub-modalities.

(3) Perform differential splicing analysis and identify network of genes which are coordinately spliced.

(4) Integrate both splicing and gene expression data to compare and contrast splicing and gene expression profiles.

The main aims of this vignette is to highlight the principles and technicalities behind MARVEL and to show case the main functionalities of MARVEL. We also provide results from our benchmarking exercise under the Appendix section.

We refer our prospective users to the comprehensive tutorial on using MARVEL to extract biological insights from single-cell alternative splicing analysis here: https://wenweixiong.github.io/MARVEL.html

## Example dataset

In the examples that follow, we will use published single induced pluripotent stem cells (iPSC) and endoderm cells. Single cells were prepared using Smart-seq2 and sequenced on HiSeq2000 on 125-bp paired-end (PE) mode (Linker et al., 2018).

## Installing the package

Install package from Github

library(devtools)
install_github("wenweixiong/MARVEL")


Or install package from CRAN

install.packages("MARVEL")


# Load package
library(MARVEL)


# 2. Creating a Marvel object

MARVEL uses S3 object-oriented system (OOS) to allow convenient data manipulation by users. Therefore, the first step is to create the S3 object. We will use CreateMarvelObject to create our S3 object. Arguments required are:
(1) SplicePheno Sample metadata. Mandatory columns are sample.id and cell.type.
(2) SpliceJunction Splice junction counts.
(3) SpliceFeature Splicing event metadata. Each element in the list represents a data frame corresponding to a specific splicing event type. Mandatory columns are tran_id and gene_id. Names of each element in the list should reflect the splicing event type, i.e. SE, MXE, RI, A5SS, and A3SS.

It is noteworthy that MARVEL is agnostic with regards to splice junction and splicing event detection. Hence, users have the freedom to use their preferred softwares to detect splice junctions and splicing events. Here, splice junction counts and splicing events were generated using STAR (Dobin et al., 2013) and Stringtie2-rMATS (Kovaka et al., 2019; Shen et al., 2014), respectively.

Additionally, it is highly recommended to include gene expression data (normalised and log-transformed) as MARVEL has useful functionalities to compare and contrast gene expression profiles.

# Read splicing files
path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL")

# Subset samples that passed sequencing QC
df.pheno <- df.pheno[which(df.pheno$qc.seq=="pass"), ] df.pheno[1:5,]  ## sample.id cell.type sample.type qc.seq ## 1 ERR1562083 Unknown Single Cell pass ## 2 ERR1562084 iPSC Single Cell pass ## 3 ERR1562085 iPSC Single Cell pass ## 4 ERR1562086 iPSC Single Cell pass ## 5 ERR1562087 iPSC Single Cell pass   # Splice junction file path_to_file <- system.file("extdata", "SJ.txt", package="MARVEL") sj <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") sj[1:5,1:5]  ## coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086 ## 1 chr7:39696860:39697407 1 0 0 0 ## 2 chr19:52190175:52200273 0 0 0 0 ## 3 chr15:24962210:24967028 30 14 26 30 ## 4 chr1:36224506:36224970 2 0 0 0 ## 5 chr10:78037305:78037438 14 29 28 14   # Splicing event metadata df.feature.list <- list() path_to_file <- system.file("extdata", "SE_featureData.txt", package="MARVEL") df.feature.list[[1]] <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") names(df.feature.list) <- "SE" df.feature.list[["SE"]][1:5,]  ## tran_id ## 1 chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120 ## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034 ## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082 ## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082 ## 5 chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484 ## gene_id gene_short_name gene_type ## 1 ENSG00000006451.8 RALA protein_coding ## 2 ENSG00000105568.18 PPP2R1A protein_coding ## 3 ENSG00000128739.22 SNRPN protein_coding ## 4 ENSG00000128739.22 SNRPN protein_coding ## 5 ENSG00000054118.15 THRAP3 protein_coding  # Read gene files # featureData path_to_file <- system.file("extdata", "TPM_featureData.txt", package="MARVEL") df.tpm.feature <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") df.tpm.feature[1:5,]  ## gene_id gene_short_name gene_type ## 1 ENSG00000004864.13 SLC25A13 protein_coding ## 2 ENSG00000006451.8 RALA protein_coding ## 3 ENSG00000052802.13 MSMO1 protein_coding ## 4 ENSG00000054118.15 THRAP3 protein_coding ## 5 ENSG00000078140.14 UBE2K protein_coding   # phenoData path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL") df.tpm.pheno <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") # Normalised expression matrix path_to_file <- system.file("extdata", "TPM.txt", package="MARVEL") df.tpm <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") # Log-transform values df.tpm[,-1] <- log2(df.tpm[,-1]) df.tpm[,-1][df.tpm[,-1] < 1] <- 0 df.tpm[1:5,1:5]  ## gene_id ERR1562083 ERR1562084 ERR1562085 ERR1562086 ## 1 ENSG00000004864.13 5.066520 3.795975 4.614710 5.044831 ## 2 ENSG00000006451.8 7.526147 5.734710 5.250962 7.148121 ## 3 ENSG00000052802.13 8.577542 6.409730 6.644145 10.211183 ## 4 ENSG00000054118.15 5.418527 5.681449 6.637494 6.673698 ## 5 ENSG00000078140.14 7.924040 7.545042 7.702796 6.804905   # Subset samples that passed sequencing QC df.tpm.pheno <- df.tpm.pheno[which(df.pheno$qc.seq=="pass"), ]
df.tpm.pheno[1:5,]

##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass

(2) PSI$SE PSI matrix where rows represent splicing events and columns represent samples. # 4. Pre-computed PSI Users may also use their own splicing events and PSI values. For example, using pre-validated splicing events and pre-computed PSI values from external softwares or that of previously generated using MARVEL. # Read sample metadata path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL") df.pheno <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA") # Subset samples that passed sequencing QC df.pheno <- df.pheno[which(df.pheno$qc.seq=="pass"), ]
df.pheno[1:5,]

##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass

# Read pre-validated splicing event metadata
df.feature.list <- list()
path_to_file <- system.file("extdata", "SE_featureDataValidated.txt", package="MARVEL")
names(df.feature.list) <- "SE"
df.feature.list[["SE"]][1:5, ]

##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##              gene_id gene_short_name      gene_type
## 1  ENSG00000006451.8            RALA protein_coding
## 2 ENSG00000105568.18         PPP2R1A protein_coding
## 3 ENSG00000128739.22           SNRPN protein_coding
## 4 ENSG00000128739.22           SNRPN protein_coding
## 5 ENSG00000054118.15          THRAP3 protein_coding

# Read PSI file (pre-computed)
df.list <- list()
path_to_file <- system.file("extdata", "SE.txt", package="MARVEL")
names(df.list) <- "SE"
df.list[["SE"]][1:5,1:5]

##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##    ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 0.002544529 0.00000000 0.00000000 0.00000000
## 2 0.000000000 0.00000000 0.00000000 0.00000000
## 3 0.115254237 0.03184713 0.17868339 0.06486486
## 4 0.059459459 0.02250804 0.09027778 0.06989247
## 5 0.045454545 0.00000000         NA 0.00000000

# Create Marvel object
marvel.temp <- CreateMarvelObject(
SpliceFeatureValidated=df.feature.list,  # Validated splicing event metadata
PSI=df.list,                             # Pre-computed PSI matrices
Exp=df.tpm                               # Gene expression matrix
)


# 5. Assign modalities

## Principles

Percent spliced-in (PSI) for a splicing event take values between 0 and 1. An average PSI of 1 means that majority of the cells express the included isoform. Conversely, an average PSI of 0 means that majority of the cells express the excluded isoform. Finally, an average PSI of 0.5 means that cells overall express both included and excluded isoforms in roughly equal proportions (Figure 2).

Based on the PSI distribution, each splicing event can be categorized into 5 modalities: Included, excluded, bimodal, middle, multimodal. To this end, MARVEL uses fitdistr function from fitdistrplus R package to determine the modality of each splicing event. Specifically, MARVEL models each splicing event as a beta distribution and estimates the alpha and beta parameters using the maximum likelihood approach. Based on the parameters' values, each splicing event can be categorized sequentially into their respective modality as follows:
(1) Bimodal (PSI $$\approx$$ 0, 1): $$\alpha$$ < 0.5 | $$\beta$$ < 0.5
(2) Included (PSI $$\approx$$ 1): $$\alpha$$ > $$\beta$$
(3) Excluded (PSI $$\approx$$ 0): $$\alpha$$ < $$\beta$$
(4) Middle (PSI $$\approx$$ 0.5): $$\alpha$$ > 1 & $$\beta$$ > 1 & $$\alpha$$ = $$\beta$$
(5) Multimodal (uniform): $$\alpha$$ = $$\beta$$ = 1

Figure 2: (Left) Representative PSI distribution of each modality. Each red diamond respresents the average PSI value across all single cells and the expected value for the corresponding bulk-level sample (Right) Illustration of each modality definition

Herein lies the advantage of studying splicing at single-cell resolution compared to that of bulk-level. Because bulk represents the average PSI across all cells, it would be virtually impossible to distinguish splicing events with bimodal, middle, and multimodal patterns.

## Sub-modalities

To provide finer resolution to the included and excluded modalities, we further stratify these two modalities into primary and dispersed, depending on their on how spread out the PSI values are (variance; see Figure 2 above). The default variance value to distinguish between these two sub-modalities is 0.001. This threshold may be customized by the user.

Figure 3: To demonstrate the various variance values of included modality

Figure 4: To demonstrate the various variance values of excluded

## Identifying false bimodals

False bimodal splicing patterns are characterized by an imbalance in the proportion of cells at the ends of the PSI range, i.e. at 0 and 1. This may be due, but not limited, to amplification bias, rare sub-population of cells within an expected homogenous cell population, and artifacts arising from library preparation and next-generation sequencing (Buen Abad Najar et al, 2020).

To formally distinguish true from false bimodal distributions, we compared PSI distribution of splicing events obtained from RNA-sequencing and compared them to that of from qPCR and smFISH (small molecule fluorescent in situ hybridization) experimental validations. We further included true bimodal distributions identified from genes with at least 10 cDNA molecules captured as recommended previously (Buen Abad Najar et al, 2020). We included 3 independent studies for this exercise (Song et al., 2017, Trapnell et al., 2014, Linker et al., 2019).

Figure 5: Representative examples of false and true bimodal patterns. First row indicates the gene name. Second row indicates the splicing event type. Third row represents the cell type. Notice that false bimodals tend to be funnel-shaped where there is a disproportionate amount of cells at one end compared to the other end. On the other hand, true bimodals tend to be hour-shaped with roughly equal proportion of cells at each end. iPSC: Induced pluripotent stem cells. MN: Motor neurons. NPC: Neural progenitor cells.

## Running the code

The function to assign modality to each splicing event across a group of cells is AssignModality.

# Assign modality
marvel <- AssignModality(
MarvelObject=marvel, # Marvel object
cell.type="iPSC",    # Cell type to include for analysis
n.cells=25,          # Min. no. of cells PSI != NA required
sigma.sq=0.001,      # Variance value below which sub-modality is primary,
# above which sub-modality is dispersed
bimodal.adjust=TRUE, # Detect and rectify false bimodals
seed=1               # Ensures MLE model returns reproducible parameter values
)

## [1] "Checking... Matrix column (sample) names match sample metadata"
## [1] "Checking... Matrix row (feature) names match feature metadata"

marvel$Modality$Results[1:5,]

##                                                                          tran_id
## SE.1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## SE.2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## SE.3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## SE.4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## SE.5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##      event_type            gene_id gene_short_name      gene_type n.cells
## SE.1         SE  ENSG00000006451.8            RALA protein_coding      80
## SE.2         SE ENSG00000105568.18         PPP2R1A protein_coding      80
## SE.3         SE ENSG00000128739.22           SNRPN protein_coding      84
## SE.4         SE ENSG00000128739.22           SNRPN protein_coding      84
## SE.5         SE ENSG00000054118.15          THRAP3 protein_coding      74
## SE.1 Excluded   Excluded.Primary     Excluded.Primary
## SE.2 Excluded   Excluded.Primary     Excluded.Primary
## SE.3 Excluded Excluded.Dispersed   Excluded.Dispersed
## SE.4 Excluded Excluded.Dispersed   Excluded.Dispersed
## SE.5 Excluded   Excluded.Primary     Excluded.Primary


## Understanding the output

The results are returned to the $Modality$Results slot. (1) n.cells column indicates the no. of cells expressing the splicing event.
(2) modality column indicates the 5 basic modalities.
(3) modality.var column indicates the extended modalities as proposed by us.
(4) modality.bimodal.adj column indicates the extended modalities whose false bimodals have been identified and reclassified into either included or excluded modalities.

# 6. Differential splicing analysis

## Principles

The aim of single-cell differential gene analysis is to determine if there is a statistically significant difference between the means of different group of cells. The non-parametric wilcoxon test and parametric t-test are suitable for comparing two groups of cells. The non-parametric Kruskal-Wallis test and parametric analysis of variance (ANOVA) are suitable for comparing two or more groups of cells. In the case of gene expression values, variance is a measure of uncertainty. The larger the variance, the lower the statistical power to detect a significant difference in means between groups of cells.

In contrast, comparing the means of the PSI distributions between groups of cells alone is insufficient. For example, bimodal, middle, and multimodal all have PSI $$\approx$$ 0.5 but have clearly different PSI distributions and therefore are categorized as different modalities. Moreover, the middle modality has different variance compared to both bimodal and multimodal modalities. Hence, in splicing analysis, variance is informative.

Therefore, the aim of single-cell differential splicing analysis is to determine if the distribution of PSI values of one group of cells is significantly different from another group of cells. A suitable statistical method for comparing the PSI distribution between groups of single cell is the Kolmogorov-Smirnov test. Other statistical test options available in MARVEL are wilcoxon and t-tests.

## Running the code

The function to perform differential splicing and gene expression analysis is CompareValues. Set the level argument to "splicing" or "gene" for differential splicing or gene expression analysis, respectively.

## Understanding the output

The results for differential splicing or gene expression analysis are returned to $DE$PSI or DE\$Gene splot, respectively.
(1) n.cells column indicates no. of cells expressing the splicing event (PSI != NA) or gene (value != 0).
(2) mean.g1 and mean.g2 columsn indicate mean values for first and second cell type, respectively, as specified in the cell.type argument.
(3) mean.diff / mean.fd = mean.g2 - mean.g1.
(4) p.val column indicates unadjusted p-values from statistical test as specified in method argument.
(5) p.val.adj column indicates adjusted p-values as specified in method.adj argument.

# 7. Final remarks

## How MARVEL was benchmarked

We selected three datasets to orthogonally validate MARVEL. The results of our benchmarking exercise are described under Appendix.
(1) The first dataset consists of single and matched-bulk induced pluripotent stem cells, motor neurons, and neural progenitor cells. Single cells were prepared using SMARTer Ultra Low RNA cDNA Synthesis Kit and sequenced on Illumina HiSeq2000 in 100bp PE mode (Song et al., 2017).
(2) The second dataset consists of single and matched-bulk cultured myoblast harvested at 0-, 24-, 48, and 72-hrs. Single cells were prepared using SMARTer Ultra Low RNA cDNA Synthesis Kit and sequenced on Illumina HiSeq2500 in 100bp PE mode (Trapnell et al., 2014).
(3) The third dataset consists of single induced pluripotent stem cells and endoderm cells. Single cells were prepared using Smart-seq2 and sequenced on Illumina HiSeq2000 in 125bp PE mode (Linker et al., 2019).

We compared MARVEL against two other peer-reviewed single-cell splicing softwares, namely BRIE and Expedition. Notably, MARVEL includes features that were not available or limited in either BRIE or Expedition (Please refer to comparison table under Introduction). Therefore, MARVEL is a more comprehensive package compared to either BRIE or Expedition alone for single-cell splicing analysis.

It is noteworthy that although MARVEL was benchmarked using single-cell data prepared using Smart-seq2 and SMARTer protocols, MARVEL can in principle be used to quantify and perform differential splicing analysis from any Illumina short-read RNA-sequencing experiments. These include 5'- and 3'-biased datasets such as those generated using 10x Genomics.

## Companion of MARVEL

Previously, we developed VALERIE for in silico validation of splicing events at single-cell resolution. VALERIE is an acronym for Visualizing ALternative splicing Events from single-cell RIbonucleic-acid (RNA)-sequencing Experiments. Users may use MARVEL to quantify splicing events and identify splicing events that are differentially expressed between groups of single cells. These significant splicing events can be visually validated using VALERIE to further narrow down true positives for downstream analysis and functional validation. Our paper on VALERIE has been published in PLoS Computational Biology (Wen et al., 2020b).

# 8. Appendix

## Compute percent spliced-in (PSI)

Figure 6: Correlation of single-cell PSI values for skipped-exon (SE) between BRIE, Expedition, and MARVEL. Overall correlation between the softwares are high (median > 0.80 for all comparisons across all three datasets). BRIE combines sequence features predictive of PSI together with sequencing reads information to infer PSI values. This is known as the Bayesian approach. On the other hand, Expedition and MARVEL use sequencing reads to directly compute PSI values. This is known as the frequentist approach. Therefore, correlaton between frequentist approaches (e.g. MARVEL vs Expedition) is higher compared to that of frequentist vs Bayesian approach (e.g. Expedition vs BRIE, MARVEL vs BRIE). PSI: Percent spliced-in.

Figure 7: Correlation of single-cell average PSI values vs matched-bulk PSI values for skipped-exon (SE) for BRIE, Expedition, and MARVEL. Overall correlation for all softwares are high (median > 0.75 all two datasets). Notably, MARVEL has the highest single-cell vs matched-bulk correlation, followed by BRIE and Expedition. PSI: Percent spliced-in.

## Correcting false bimodals

Figure 8: A previous study recommended including only genes with at least 10 mRNA molecule counts for modality analysis in order to eliminate false bimodal patterns (Buen Abad Najar et al, 2020). However, using this stringent threshold removes a significant proportion of genes for modality analysis. More genes become ineligible when a higher no. of cells are specified for modality analysis. iPSC: Induced pluripotent stem cells. MN: Motor neurons. NPC: Neural progenitor cells.

Figure 9: To distinguish between true and false bimodal patterns, we first identified a set of true and false bimodal patterns as controls (please see Assign modalities: Identifying false bimodals). We found the (Left) ratio of 3 and (Right) differences of 45% between the proportion of cells at one end over the other end are able to make this distinction.