1. Introduction

Alternative splicing represents an additional and under-appreciated layer of complexity underlying gene expression profiles. More recently, technological advances in library preparation methodologies enabled capturing and amplification of full-length cDNAs from single cells. Thus, paving the way for splicing analysis at single-cell resolution.

Nevertheless, single-cell splicing analysis comes with its own set of challenges including, but not limited to, low coverage of lowly-expressed genes, low capture rate of cDNA molecules, and amplification bias (Wen et al., 2020). To date, there remains a paucity of peer-reviewed softwares available for single-cell splicing analysis. Notable examples are BRIE (Huang & Sanguinetti, 2017) and Expedition (Song et al., 2017).

Here, we introduce MARVEL (Modality Assessment to ReVEaL alternative splicing dynamics at single-cell resolution). MARVEL includes features that complement existing softwares in order to more comprehensively describe and reveal splicing dynamics at single-cell resolution.

The following is a quick comparison against two other leading published softwares for single-cell alternative splicing analysis:

plot of chunk unnamed-chunk-1

* BRIE incorporates sequence features to model PSI values. These features were identified and trained on human and mouse data only.

Objectives

The main objectives of MARVEL are

(1) Compute PSI values for all five main exon-level splicing events, i.e. skipped-exon (SE), mutually-exclusive exons (MXE), retained-intron (RI), alternative 5' splice site (A5SS), and alternative 3' splice site (A3SS).

(2) Stratify PSI distribution for each splicing event into the five main modalities, i.e. included, excluded, bimodal, middle, and multimodal. Further stratify included and excluded into primary and dispersed sub-modalities.

(3) Perform differential splicing analysis and identify network of genes which are coordinately spliced.

(4) Integrate both splicing and gene expression data to compare and contrast splicing and gene expression profiles.

Additional resource

The main aims of this vignette is to highlight the principles and technicalities behind MARVEL and to show case the main functionalities of MARVEL. We also provide results from our benchmarking exercise under the Appendix section.

We refer our prospective users to the comprehensive tutorial on using MARVEL to extract biological insights from single-cell alternative splicing analysis here: https://wenweixiong.github.io/MARVEL.html

Example dataset

In the examples that follow, we will use published single induced pluripotent stem cells (iPSC) and endoderm cells. Single cells were prepared using Smart-seq2 and sequenced on HiSeq2000 on 125-bp paired-end (PE) mode (Linker et al., 2018).

Installing the package

Install package from Github

library(devtools)
install_github("wenweixiong/MARVEL")

Or install package from CRAN

install.packages("MARVEL")

Loading the package

# Load package
library(MARVEL)

2. Creating a Marvel object

MARVEL uses S3 object-oriented system (OOS) to allow convenient data manipulation by users. Therefore, the first step is to create the S3 object. We will use CreateMarvelObject to create our S3 object. Arguments required are:
(1) SplicePheno Sample metadata. Mandatory columns are sample.id and cell.type.
(2) SpliceJunction Splice junction counts.
(3) SpliceFeature Splicing event metadata. Each element in the list represents a data frame corresponding to a specific splicing event type. Mandatory columns are tran_id and gene_id. Names of each element in the list should reflect the splicing event type, i.e. SE, MXE, RI, A5SS, and A3SS.

It is noteworthy that MARVEL is agnostic with regards to splice junction and splicing event detection. Hence, users have the freedom to use their preferred softwares to detect splice junctions and splicing events. Here, splice junction counts and splicing events were generated using STAR (Dobin et al., 2013) and Stringtie2-rMATS (Kovaka et al., 2019; Shen et al., 2014), respectively.

Additionally, it is highly recommended to include gene expression data (normalised and log-transformed) as MARVEL has useful functionalities to compare and contrast gene expression profiles.

# Read splicing files
  # Sample metadata
  path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL")
  df.pheno <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

  # Subset samples that passed sequencing QC
  df.pheno <- df.pheno[which(df.pheno$qc.seq=="pass"), ]
  df.pheno[1:5,]
##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass
  # Splice junction file
  path_to_file <- system.file("extdata", "SJ.txt", package="MARVEL")
  sj <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
  sj[1:5,1:5]
##              coord.intron ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1  chr7:39696860:39697407          1          0          0          0
## 2 chr19:52190175:52200273          0          0          0          0
## 3 chr15:24962210:24967028         30         14         26         30
## 4  chr1:36224506:36224970          2          0          0          0
## 5 chr10:78037305:78037438         14         29         28         14
  # Splicing event metadata
  df.feature.list <- list()
  path_to_file <- system.file("extdata", "SE_featureData.txt", package="MARVEL")
  df.feature.list[[1]] <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
  names(df.feature.list) <- "SE"
  df.feature.list[["SE"]][1:5,]
##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##              gene_id gene_short_name      gene_type
## 1  ENSG00000006451.8            RALA protein_coding
## 2 ENSG00000105568.18         PPP2R1A protein_coding
## 3 ENSG00000128739.22           SNRPN protein_coding
## 4 ENSG00000128739.22           SNRPN protein_coding
## 5 ENSG00000054118.15          THRAP3 protein_coding
# Read gene files
  # featureData
  path_to_file <- system.file("extdata", "TPM_featureData.txt", package="MARVEL")
  df.tpm.feature <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
  df.tpm.feature[1:5,]
##              gene_id gene_short_name      gene_type
## 1 ENSG00000004864.13        SLC25A13 protein_coding
## 2  ENSG00000006451.8            RALA protein_coding
## 3 ENSG00000052802.13           MSMO1 protein_coding
## 4 ENSG00000054118.15          THRAP3 protein_coding
## 5 ENSG00000078140.14           UBE2K protein_coding
  # phenoData
  path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL")
  df.tpm.pheno <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

  # Normalised expression matrix
  path_to_file <- system.file("extdata", "TPM.txt", package="MARVEL")
  df.tpm <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

  # Log-transform values
  df.tpm[,-1] <- log2(df.tpm[,-1])
  df.tpm[,-1][df.tpm[,-1] < 1] <- 0

  df.tpm[1:5,1:5]
##              gene_id ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 ENSG00000004864.13   5.066520   3.795975   4.614710   5.044831
## 2  ENSG00000006451.8   7.526147   5.734710   5.250962   7.148121
## 3 ENSG00000052802.13   8.577542   6.409730   6.644145  10.211183
## 4 ENSG00000054118.15   5.418527   5.681449   6.637494   6.673698
## 5 ENSG00000078140.14   7.924040   7.545042   7.702796   6.804905
  # Subset samples that passed sequencing QC
  df.tpm.pheno <- df.tpm.pheno[which(df.pheno$qc.seq=="pass"), ]
  df.tpm.pheno[1:5,]
##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass
  df.tpm <- df.tpm[, which(names(df.tpm) %in% c("gene_id", df.tpm.pheno$sample.id))]

# Create Marvel object
marvel <- CreateMarvelObject(
            SplicePheno=df.pheno,          # Sample metadata
            SpliceJunction=sj,             # Splice junction counts 
            SpliceFeature=df.feature.list, # Splicing event metadata
            GenePheno=df.tpm.pheno,        # Sample metadata
            GeneFeature=df.tpm.feature,    # Gene metadata
            Exp=df.tpm                     # Gene expression matrix
            )

3. Calculate percent spliced-in (PSI)

Principles

Percent spliced-in (PSI) is the fraction of reads supporting the included isoform over the number of reads supporting both included and excluded isoforms. Included isoforms are isoforms that include the alternative exon. Hence, PSI is a measure of alternative exon inclusion (“spliced-in”).

To date, peer-reviewed softwares focused only on SE and MXE events. MARVEL expands exon-level splicing analysis to include all five main exon-level splicing events, namely SE, MXE, RI, A5SS, and A3SS. Other than SE, the other 4 splicing types have been reported to play important roles in both health and disease too. For example, RI is known to regulate gene expression through nonsense-mediated decay (Smart et al., 2018).

First, MARVEL will cross-check the splicing events with the splice junctions provided. Only splicing events whose splice junctions are found in the splice junction file are retained. This ensures only high-quality splicing events are included for PSI calculation and downstream analysis.

Furthermore, MARVEL allows users to set a coverage threshold using the CoverageThreshold argument, above which the splicing event is included for PSI calculation. For example, if the coverage threshold is set at 10, then MARVEL will only include the splicing event if all included or excluded junctions involved in calculating the splicing events are supported by at least 10 or more reads. NA in the PSI matrix returned are splicing events whose sample that did not have sufficient reads, i.e. lower number of reads than that specified by the user. The coverage threshold of 10 has been used multiple times in previous single-cell studies for selecting splicing events for downstream analysis (Song et al., 2017; Buen Abad Najar et al, 2020).

plot of chunk unnamed-chunk-6

Figure 1: Alternative splicing event types and their respective PSI formula. PSI: Percent splice-in

Running the code

The function to compute PSI is ComputePSI. Specify "SE", "MXE", "RI", "A5SS", or "A3SS" in the EventType argument to compute the PSI for a specific event type. The options for this function are:
(1) MarvelObject Marvel object generated from CreateMarvelObject.
(2) CoverageThreshold Coverage threshold below which the PSI of the splicing event will be censored, i.e. annotated as missing (NA). Coverage defined as the total number of reads supporting both included and excluded isoforms.

# Validate and filter splicing events, compute PSI
marvel <- ComputePSI.SE(MarvelObject=marvel, CoverageThreshold=10)
## [1] "Analysing 20 splicing events"
## [1] "20 splicing events validated and quantified"
# Check validated splicing events
marvel$SpliceFeatureValidated$SE[1:5,]
##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##   event_type            gene_id gene_short_name      gene_type
## 1         SE  ENSG00000006451.8            RALA protein_coding
## 2         SE ENSG00000105568.18         PPP2R1A protein_coding
## 3         SE ENSG00000128739.22           SNRPN protein_coding
## 4         SE ENSG00000128739.22           SNRPN protein_coding
## 5         SE ENSG00000054118.15          THRAP3 protein_coding
# Check computed PSI values
marvel$PSI$SE[1:5,1:5]
##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##    ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 0.002544529 0.00000000 0.00000000 0.00000000
## 2 0.000000000 0.00000000 0.00000000 0.00000000
## 3 0.115254237 0.03184713 0.17868339 0.06486486
## 4 0.059459459 0.02250804 0.09027778 0.06989247
## 5 0.045454545 0.00000000         NA 0.00000000

Understanding the output

Two data frames are returned and saved into the following slots of the Marvel object.
(1) SpliceFeatureValidated$SE Validated splicing events.
(2) PSI$SE PSI matrix where rows represent splicing events and columns represent samples.

4. Pre-computed PSI

Users may also use their own splicing events and PSI values. For example, using pre-validated splicing events and pre-computed PSI values from external softwares or that of previously generated using MARVEL.

# Read sample metadata
path_to_file <- system.file("extdata", "SE_phenoData.txt", package="MARVEL")
df.pheno <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")

# Subset samples that passed sequencing QC
df.pheno <- df.pheno[which(df.pheno$qc.seq=="pass"), ]
df.pheno[1:5,]
##    sample.id cell.type sample.type qc.seq
## 1 ERR1562083   Unknown Single Cell   pass
## 2 ERR1562084      iPSC Single Cell   pass
## 3 ERR1562085      iPSC Single Cell   pass
## 4 ERR1562086      iPSC Single Cell   pass
## 5 ERR1562087      iPSC Single Cell   pass
# Read pre-validated splicing event metadata
df.feature.list <- list()
path_to_file <- system.file("extdata", "SE_featureDataValidated.txt", package="MARVEL")
df.feature.list[[1]] <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
names(df.feature.list) <- "SE"
df.feature.list[["SE"]][1:5, ]
##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##              gene_id gene_short_name      gene_type
## 1  ENSG00000006451.8            RALA protein_coding
## 2 ENSG00000105568.18         PPP2R1A protein_coding
## 3 ENSG00000128739.22           SNRPN protein_coding
## 4 ENSG00000128739.22           SNRPN protein_coding
## 5 ENSG00000054118.15          THRAP3 protein_coding
# Read PSI file (pre-computed)
df.list <- list()
path_to_file <- system.file("extdata", "SE.txt", package="MARVEL")
df.list[[1]] <- read.table(path_to_file, sep="\t", header=TRUE, stringsAsFactors=FALSE, na.strings="NA")
names(df.list) <- "SE"
df.list[["SE"]][1:5,1:5]
##                                                                       tran_id
## 1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## 2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## 3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## 4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## 5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##    ERR1562083 ERR1562084 ERR1562085 ERR1562086
## 1 0.002544529 0.00000000 0.00000000 0.00000000
## 2 0.000000000 0.00000000 0.00000000 0.00000000
## 3 0.115254237 0.03184713 0.17868339 0.06486486
## 4 0.059459459 0.02250804 0.09027778 0.06989247
## 5 0.045454545 0.00000000         NA 0.00000000
# Create Marvel object
marvel.temp <- CreateMarvelObject(
            SplicePheno=df.pheno,                    # Sample metadata
            SpliceFeatureValidated=df.feature.list,  # Validated splicing event metadata
            PSI=df.list,                             # Pre-computed PSI matrices
            GenePheno=df.tpm.pheno,                  # Sample metadata
            GeneFeature=df.tpm.feature,              # Gene metadata
            Exp=df.tpm                               # Gene expression matrix
            )

5. Assign modalities

Principles

Percent spliced-in (PSI) for a splicing event take values between 0 and 1. An average PSI of 1 means that majority of the cells express the included isoform. Conversely, an average PSI of 0 means that majority of the cells express the excluded isoform. Finally, an average PSI of 0.5 means that cells overall express both included and excluded isoforms in roughly equal proportions (Figure 2).

Based on the PSI distribution, each splicing event can be categorized into 5 modalities: Included, excluded, bimodal, middle, multimodal. To this end, MARVEL uses fitdistr function from fitdistrplus R package to determine the modality of each splicing event. Specifically, MARVEL models each splicing event as a beta distribution and estimates the alpha and beta parameters using the maximum likelihood approach. Based on the parameters' values, each splicing event can be categorized sequentially into their respective modality as follows:
(1) Bimodal (PSI \(\approx\) 0, 1): \(\alpha\) < 0.5 | \(\beta\) < 0.5
(2) Included (PSI \(\approx\) 1): \(\alpha\) > \(\beta\)
(3) Excluded (PSI \(\approx\) 0): \(\alpha\) < \(\beta\)
(4) Middle (PSI \(\approx\) 0.5): \(\alpha\) > 1 & \(\beta\) > 1 & \(\alpha\) = \(\beta\)
(5) Multimodal (uniform): \(\alpha\) = \(\beta\) = 1

plot of chunk unnamed-chunk-9

Figure 2: (Left) Representative PSI distribution of each modality. Each red diamond respresents the average PSI value across all single cells and the expected value for the corresponding bulk-level sample (Right) Illustration of each modality definition

Herein lies the advantage of studying splicing at single-cell resolution compared to that of bulk-level. Because bulk represents the average PSI across all cells, it would be virtually impossible to distinguish splicing events with bimodal, middle, and multimodal patterns.

Sub-modalities

To provide finer resolution to the included and excluded modalities, we further stratify these two modalities into primary and dispersed, depending on their on how spread out the PSI values are (variance; see Figure 2 above). The default variance value to distinguish between these two sub-modalities is 0.001. This threshold may be customized by the user.

plot of chunk unnamed-chunk-10

Figure 3: To demonstrate the various variance values of included modality

plot of chunk unnamed-chunk-11

Figure 4: To demonstrate the various variance values of excluded

Identifying false bimodals

False bimodal splicing patterns are characterized by an imbalance in the proportion of cells at the ends of the PSI range, i.e. at 0 and 1. This may be due, but not limited, to amplification bias, rare sub-population of cells within an expected homogenous cell population, and artifacts arising from library preparation and next-generation sequencing (Buen Abad Najar et al, 2020).

To formally distinguish true from false bimodal distributions, we compared PSI distribution of splicing events obtained from RNA-sequencing and compared them to that of from qPCR and smFISH (small molecule fluorescent in situ hybridization) experimental validations. We further included true bimodal distributions identified from genes with at least 10 cDNA molecules captured as recommended previously (Buen Abad Najar et al, 2020). We included 3 independent studies for this exercise (Song et al., 2017, Trapnell et al., 2014, Linker et al., 2019).

plot of chunk unnamed-chunk-12

Figure 5: Representative examples of false and true bimodal patterns. First row indicates the gene name. Second row indicates the splicing event type. Third row represents the cell type. Notice that false bimodals tend to be funnel-shaped where there is a disproportionate amount of cells at one end compared to the other end. On the other hand, true bimodals tend to be hour-shaped with roughly equal proportion of cells at each end. iPSC: Induced pluripotent stem cells. MN: Motor neurons. NPC: Neural progenitor cells.

Running the code

The function to assign modality to each splicing event across a group of cells is AssignModality.

# Assign modality
marvel <- AssignModality(
            MarvelObject=marvel, # Marvel object
            cell.type="iPSC",    # Cell type to include for analysis
            n.cells=25,          # Min. no. of cells PSI != NA required 
            sigma.sq=0.001,      # Variance value below which sub-modality is primary,
                                                # above which sub-modality is dispersed
            bimodal.adjust=TRUE, # Detect and rectify false bimodals
            seed=1               # Ensures MLE model returns reproducible parameter values
            )
## [1] "Checking... Matrix column (sample) names match sample metadata"
## [1] "Checking... Matrix row (feature) names match feature metadata"
marvel$Modality$Results[1:5,]
##                                                                          tran_id
## SE.1    chr7:39696685:39696859:+@chr7:39697408:39697510:+@chr7:39706123:39708120
## SE.2 chr19:52190048:52190174:+@chr19:52200274:52200398:+@chr19:52201944:52202034
## SE.3 chr15:24962114:24962209:+@chr15:24967029:24967152:+@chr15:24967932:24968082
## SE.4 chr15:24962114:24962209:+@chr15:24967029:24967240:+@chr15:24967932:24968082
## SE.5    chr1:36224432:36224505:+@chr1:36224971:36225102:+@chr1:36259382:36259484
##      event_type            gene_id gene_short_name      gene_type n.cells
## SE.1         SE  ENSG00000006451.8            RALA protein_coding      80
## SE.2         SE ENSG00000105568.18         PPP2R1A protein_coding      80
## SE.3         SE ENSG00000128739.22           SNRPN protein_coding      84
## SE.4         SE ENSG00000128739.22           SNRPN protein_coding      84
## SE.5         SE ENSG00000054118.15          THRAP3 protein_coding      74
##      modality       modality.var modality.bimodal.adj
## SE.1 Excluded   Excluded.Primary     Excluded.Primary
## SE.2 Excluded   Excluded.Primary     Excluded.Primary
## SE.3 Excluded Excluded.Dispersed   Excluded.Dispersed
## SE.4 Excluded Excluded.Dispersed   Excluded.Dispersed
## SE.5 Excluded   Excluded.Primary     Excluded.Primary

Understanding the output

The results are returned to the $Modality$Results slot. (1) n.cells column indicates the no. of cells expressing the splicing event.
(2) modality column indicates the 5 basic modalities.
(3) modality.var column indicates the extended modalities as proposed by us.
(4) modality.bimodal.adj column indicates the extended modalities whose false bimodals have been identified and reclassified into either included or excluded modalities.

6. Differential splicing analysis

Principles

The aim of single-cell differential gene analysis is to determine if there is a statistically significant difference between the means of different group of cells. The non-parametric wilcoxon test and parametric t-test are suitable for comparing two groups of cells. The non-parametric Kruskal-Wallis test and parametric analysis of variance (ANOVA) are suitable for comparing two or more groups of cells. In the case of gene expression values, variance is a measure of uncertainty. The larger the variance, the lower the statistical power to detect a significant difference in means between groups of cells.

In contrast, comparing the means of the PSI distributions between groups of cells alone is insufficient. For example, bimodal, middle, and multimodal all have PSI \(\approx\) 0.5 but have clearly different PSI distributions and therefore are categorized as different modalities. Moreover, the middle modality has different variance compared to both bimodal and multimodal modalities. Hence, in splicing analysis, variance is informative.

Therefore, the aim of single-cell differential splicing analysis is to determine if the distribution of PSI values of one group of cells is significantly different from another group of cells. A suitable statistical method for comparing the PSI distribution between groups of single cell is the Kolmogorov-Smirnov test. Other statistical test options available in MARVEL are wilcoxon and t-tests.

Running the code

The function to perform differential splicing and gene expression analysis is CompareValues. Set the level argument to "splicing" or "gene" for differential splicing or gene expression analysis, respectively.

Understanding the output

The results for differential splicing or gene expression analysis are returned to $DE$PSI or DE$Gene splot, respectively.
(1) n.cells column indicates no. of cells expressing the splicing event (PSI != NA) or gene (value != 0).
(2) mean.g1 and mean.g2 columsn indicate mean values for first and second cell type, respectively, as specified in the cell.type argument.
(3) mean.diff / mean.fd = mean.g2 - mean.g1.
(4) p.val column indicates unadjusted p-values from statistical test as specified in method argument.
(5) p.val.adj column indicates adjusted p-values as specified in method.adj argument.

7. Final remarks

How MARVEL was benchmarked

We selected three datasets to orthogonally validate MARVEL. The results of our benchmarking exercise are described under Appendix.
(1) The first dataset consists of single and matched-bulk induced pluripotent stem cells, motor neurons, and neural progenitor cells. Single cells were prepared using SMARTer Ultra Low RNA cDNA Synthesis Kit and sequenced on Illumina HiSeq2000 in 100bp PE mode (Song et al., 2017).
(2) The second dataset consists of single and matched-bulk cultured myoblast harvested at 0-, 24-, 48, and 72-hrs. Single cells were prepared using SMARTer Ultra Low RNA cDNA Synthesis Kit and sequenced on Illumina HiSeq2500 in 100bp PE mode (Trapnell et al., 2014).
(3) The third dataset consists of single induced pluripotent stem cells and endoderm cells. Single cells were prepared using Smart-seq2 and sequenced on Illumina HiSeq2000 in 125bp PE mode (Linker et al., 2019).

We compared MARVEL against two other peer-reviewed single-cell splicing softwares, namely BRIE and Expedition. Notably, MARVEL includes features that were not available or limited in either BRIE or Expedition (Please refer to comparison table under Introduction). Therefore, MARVEL is a more comprehensive package compared to either BRIE or Expedition alone for single-cell splicing analysis.

It is noteworthy that although MARVEL was benchmarked using single-cell data prepared using Smart-seq2 and SMARTer protocols, MARVEL can in principle be used to quantify and perform differential splicing analysis from any Illumina short-read RNA-sequencing experiments. These include 5'- and 3'-biased datasets such as those generated using 10x Genomics.

Companion of MARVEL

Previously, we developed VALERIE for in silico validation of splicing events at single-cell resolution. VALERIE is an acronym for Visualizing ALternative splicing Events from single-cell RIbonucleic-acid (RNA)-sequencing Experiments. Users may use MARVEL to quantify splicing events and identify splicing events that are differentially expressed between groups of single cells. These significant splicing events can be visually validated using VALERIE to further narrow down true positives for downstream analysis and functional validation. Our paper on VALERIE has been published in PLoS Computational Biology (Wen et al., 2020b).

8. Appendix

Compute percent spliced-in (PSI)

plot of chunk unnamed-chunk-15

Figure 6: Correlation of single-cell PSI values for skipped-exon (SE) between BRIE, Expedition, and MARVEL. Overall correlation between the softwares are high (median > 0.80 for all comparisons across all three datasets). BRIE combines sequence features predictive of PSI together with sequencing reads information to infer PSI values. This is known as the Bayesian approach. On the other hand, Expedition and MARVEL use sequencing reads to directly compute PSI values. This is known as the frequentist approach. Therefore, correlaton between frequentist approaches (e.g. MARVEL vs Expedition) is higher compared to that of frequentist vs Bayesian approach (e.g. Expedition vs BRIE, MARVEL vs BRIE). PSI: Percent spliced-in.

plot of chunk unnamed-chunk-16

Figure 7: Correlation of single-cell average PSI values vs matched-bulk PSI values for skipped-exon (SE) for BRIE, Expedition, and MARVEL. Overall correlation for all softwares are high (median > 0.75 all two datasets). Notably, MARVEL has the highest single-cell vs matched-bulk correlation, followed by BRIE and Expedition. PSI: Percent spliced-in.

Correcting false bimodals

plot of chunk unnamed-chunk-17

Figure 8: A previous study recommended including only genes with at least 10 mRNA molecule counts for modality analysis in order to eliminate false bimodal patterns (Buen Abad Najar et al, 2020). However, using this stringent threshold removes a significant proportion of genes for modality analysis. More genes become ineligible when a higher no. of cells are specified for modality analysis. iPSC: Induced pluripotent stem cells. MN: Motor neurons. NPC: Neural progenitor cells.

plot of chunk unnamed-chunk-18

Figure 9: To distinguish between true and false bimodal patterns, we first identified a set of true and false bimodal patterns as controls (please see Assign modalities: Identifying false bimodals). We found the (Left) ratio of 3 and (Right) differences of 45% between the proportion of cells at one end over the other end are able to make this distinction.