--- title: 'Main vignette: Aberrant expression and splicing analysis' author: - name: Alexandre Segers - name: Jeroen Gilis - name: Mattias Van Heetvelde - name: Elfride De Baere - name: Lieven Clement bibliography: saseR.bib date: "23/08/2023" output: BiocStyle::html_document: toc: true toc_depth: 3 BiocStyle::pdf_document: default package: saseR abstract: | Main vignette for the saseR package. This vignette aims to provide a detailed description of an aberrant expression and splicing analysis using saseR. Also, we show how to perform differential usage analysis using adapted offsets in DESeq2 and edgeR. vignette: > %\VignetteIndexEntry{Main vignette: saseR analyses} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} library(knitr) ``` # Introduction `saseR` is a R package to perform aberrant expression and splicing analyses in bulk RNA-seq datasets. It uses adapted offsets to model aberrant splicing with the negative binomial framework, and estimates the parameters on a scalable and efficient manner. In this vignette, we first show how the different functions of the `saseR` package can be used, after which we show how to perform `differential usage` analysis with adapted offsets in conventional bulk RNA-seq methods DESeq2 and edgeR. All details about the `saseR` model, statistical tests, the use of adapted offsets to model proportions and the adapted parameter estimation are described in our preprint [@Segers2023]. In this vignette, we analyse the data from the `ASpli` package [@ASpli], to show how a workflow for aberrant expression and splicing is done. However, we note that this dataset is not specific for aberrant expression or splicing analyses. The data consists of BAM-files which will be converted into gene, bin,and junction reads, on which three different `saseR` analysis will be performed (aberrant expression and two times an aberrant splicing analysis). Next, we show how a differential usage/splicing analysis can be done using DESeq2 [@Love] and edgeR [@Robinson] with adapted offsets. # Aberrant expression and splicing workflow The package has two different categories of functions, with respectively two and three functions. First, BAM-files need to be transformed to gene, bin or junctions counts. This is done by: 1. `BamtoAspliCounts`, which transforms BAM-files to different counts using adapted functions of the `ASpli` package. The different counts are given in a `SummarizedExperiment` object in the metadata slots. 2. `convertAspli` selects the relevant counts (gene, bins or junctions) to be further used in the analyses. Then, the chosen counts can be further used to perform an aberrant expression or splicing analysis with the following functions: 3. `calculateOffsets`, which calculates the offset matrix used in the analysis. These offsets can be ordinary `edgeR` or `DESeq2` offsets when performing an aberrant expression analysis, or can be the total bin or junction counts aggregated per gene when performing an aberrant splicing analysis. 4. `saseRfindEncodingDim` estimates the optimal number of latent factors to control for unknown confounders. This number of latent factors is then further used in the final fit. 5. `saseRfit` calculates the mean expression and feature-specific dispersions for each feature in each sample, controlling for both known and unknown confounders. It calculates p-values which represent how extreme each observation is compared to its estimated distribution. Note that one can do a saseR analysis by using readily available count matrices, therefore not requiring to start from BAM-files. ## Package installation `saseR` can be installed from Bioconductor with: ```{r, eval=FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("saseR") ``` Alternatively, the development version of `saseR` can be downloaded with: ```{r, eval=FALSE} library(devtools) devtools::install_github("statOmics/saseR") ``` ## Load libraries ```{r, message=FALSE, warning=FALSE} library(saseR) suppressWarnings(library(ASpli)) library(edgeR) library(DESeq2) library(limma) library(GenomicFeatures) library(dplyr) library(pracma) library(PRROC) library(BiocParallel) library(data.table) library(igraph) ``` ## Transforming BAM to count matrices. The following data corresponds to the vignette data from the `ASpli` package [@ASpli]. The input data required for saseR are BAM-files with corresponding BAI files and a genome annotation in TxDb format. These BAM-files will be transformed to gene/bin/junction count matrices. When a count matrix is readily available, one can just use this to perform an aberrant expression or splicing analysis (see section 5 and 6) . First, we specify the path to our GTF file, which we will later transform to the TxDb format. Also, we include the pathnames of the BAM-files we want to include in the analysis. ```{r} gtfFileName <- ASpli::aspliExampleGTF() BAMFiles <- ASpli::aspliExampleBamList() gtfFileName head(BAMFiles) ``` Then, we define a targets dataframe, which contains the sample names, BAM-file pathsand experimental factors column. For the case of rare Mendelian disorders and outlier detection, we choose these experimental factors to be all equal to A. Note that the experimental factors included here will not further be used in the final analysis, but are needed to compute the count matrices with functions from `ASpli`. ```{r} targets <- data.frame( row.names = paste0('Sample',c(1:12)), bam = BAMFiles, f1 = rep("A",12), stringsAsFactors = FALSE) head(targets) ``` Next, we transform our GTF file to TxDb format, and extract the features from the annotation using the `binGenome` function from `saseR`, which is an adapted copy from `ASpli` as this function is deprecated. Note that the binGenome function can take a while. ```{r} genomeTxDb <- txdbmaker::makeTxDbFromGFF(gtfFileName) features <- binGenome(genomeTxDb, logTo = NULL) ``` Now that we have the `targets` data.frame containing the samples and BAM-file names, and the `features` extracted from the genome annotation, we can obtain the count matrices that will be used to detect aberrant expression. This can done with `BamtoAspliCounts`. Note that the counting procedure (featurecounts) can take a while. If only one type of analysis is preferred (e.g. only aberrant expression analysis on the gene counts), one can do the feature counting with a different method, and start from a count matrix at section 5 and 6. Make sure to select the right type of `libType` (`SE` for single-end reads, `PE` for paired-end reads), and the `minReadLength`, which is the minimal number of bases a read needs to be counted. ```{r} ASpliSE <- BamtoAspliCounts( features = features, targets = targets, minReadLength = 100, libType = "SE", BPPARAM = MulticoreParam(1L) ) ``` ASpliSE is a SummarizedExperiment object that contains the gene, bin and junction counts, together with different annotations in the metadata slots. We will extract these counts using the `convertASpli` function. ```{r} SEgenes <- convertASpli(ASpliSE, type = "gene") SEbins <- convertASpli(ASpliSE, type = "bin") SEjunctions <- convertASpli(ASpliSE, type = "junction") ``` ## Aberrant expression analysis We will show how to perform an aberrant expression analysis based on the gene level counts. Note that other feature counts (e.g. bin counts) can also be used to search for aberrant expression. We start from a SummarizedExperiment that contains the feature counts of interest. First, a design formula is included in the metadata slot `design`. In this example, we do not include known covariates, but only introduce an intercept. If one includes known covariates in the formula, these covariates should be included in the colData slot. ```{r} metadata(SEgenes)$design <- ~1 ``` For an aberrant expression analysis, we first retain only the genes that have sufficiently large counts. Then, we need to calculate the offsets for our analysis. For aberrant expression analysis, we recommend using `TMM` or `geommean` for edgeR and DESeq2 normalisation respectively. This results in an offsets matrix in the assays slot. Note that for aberrant splicing detection, first offsets are calculated before filtering features that have low counts. ```{r} filtergenes <- filterByExpr(SEgenes) SEgenes <- SEgenes[filtergenes,] ``` ```{r} SEgenes <- calculateOffsets(SEgenes, method = "TMM") ``` Having defined the `design` and the `offsets` of the analysis, we can continue to estimate the optimal `number of latent confounders` included in the regression framework. This is often needed when performing aberrant expression or splicing analysis to improve power. We obtain the latent factors by using `RUV` [@RUV], which performs a `singular value decomposition on the deviance residuals`. Then, we recommend using the Gavish and Donoho [@GavishDonoho] threshold for singular values (`method = GD`) to choose upon the number of latent factors included in the final regression model. This is much faster, while having similar performance compared to using a denoising autoencoder (`method = DAE`). ```{r} SEgenes <- saseRfindEncodingDim(SEgenes, method = "GD") ``` Having estimated the optimal number of latent factors to include in the regression framework, we can fit our final model, controlling for both known confounders (specified in metadata slot `design`) and the latent factors based on the deviance residuals. The final regression fit can be done with edgeR (`fit = edgeR`), but is set to default to our fast parameter estimation (`fit = fast`). This fast parameter estimation uses a overdispersed quadratic mean-variance relationship, which enables matrix multiplication while keeping parameter estimation unbiased. This is often required in the context of rare diseases that need many latent factors to control for, which leads to ordinary fitting procedures such as DESeq2 and edgeR becoming slow. If the number of confounders included in the regression model is small (known confounders + latent factors), edgeR parameter estimation could be used, although it is not expected to give an increase in performance. `analysis` is specified to `AE` to choose an aberrant expression analysis. ```{r} SEgenes <- saseRfit(SEgenes, analysis = "AE", padjust = "BH", fit = "fast") ``` This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised genes within each sample is possible using for example: ```{r} order <- apply(assays(SEgenes)$pValue, 2, order, decreasing = FALSE) topPvalsGenes <- sapply(1:ncol(SEgenes), function(x) {assays(SEgenes)$pValue[order[1:5,x],x]}) rownames(topPvalsGenes) <- NULL topGenes <- sapply(1:ncol(SEgenes), function(x) {rownames(SEgenes)[order[1:5,x]]}) significantgenes <- sapply(1:ncol(SEgenes), function(x) { rownames(SEgenes)[assays(SEgenes)$pValueAdjust[,x] < 0.05]}) head(topPvalsGenes) head(topGenes) head(significantgenes) ``` ## Aberrant splicing analysis We will show how to perform an aberrant splicing analysis based on the bin and junction level counts. Note that other feature counts can also be used if proportions of interest are to be modelled, which only requires the use of the correct offsets. We start from a SummarizedExperiment that contains the feature counts of interest. First, a design formula is included in the metadata slot `design`. In this example, we do not include known covariates, but only introduce an intercept. If one includes known covariates in the formula, these covariates should be included in the colData slot. ```{r} metadata(SEbins)$design <- ~1 metadata(SEjunctions)$design <- ~1 ``` Then, we need to calculate the offsets for our analysis. For an aberrant splicing analysis, `AS` or `proportion` should be used as method to calculate offsets that allow to model proportions. This results in an offsets matrix in the assays slot. We specificy the column in the rowData slot which is used to aggregate feature counts to obtain proper offsets (`aggregation`). This is the `locus` column for the bins analysis, and the `symbol` or `ASpliCluster` column for a junctions analysis. For the junctions analysis, it is also possible to use aggregation based on `symbol` (= gene) whenever an annotated gene is available, and switch to `ASpliCluster` when the junction is not linked to a specific gene. This is done by setting `mergeGeneASpli` equal to TRUE. Note that a new column in the rowData will be made, `locus`, which specifies the used aggregation. Note that also other rowData columns can be used to aggregate feature counts, and also a predefined offsets matrix can be included as `offsets` in the assays slot. ```{r} SEbins <- calculateOffsets(SEbins, method = "AS", aggregation = "locus") SEjunctions <- calculateOffsets(SEjunctions, method = "AS", aggregation = "symbol", mergeGeneASpli = TRUE) ``` Then, after calculating the offsets for the aberrant splicing analysis, we filter the features that have very low counts. Note that for aberrant expression analyses, first the features are filtered, followed by calculating the offsets. ```{r} filterbins <- filterByExpr(SEbins) SEbins <- SEbins[filterbins,] filterjunctions <- filterByExpr(SEjunctions) SEjunctions <- SEjunctions[filterjunctions,] ``` Having defined the `design` and the `offsets` of the analysis, we can continue to estimate the optimal `number of latent confounders` included in the regression framework. This is often needed when performing aberrant expression or splicing analysis to obtain better power. We obtain the latent factors by using `RUV` [@RUV], which performs a `singular value decomposition on the deviance residuals`. Then, we recommend using the Gavish and Donoho [@GavishDonoho] threshold for singular values (`method = GD`) to choose upon the number of latent factors included in the final regression model. This is much faster, while having similar performance compared to using a denoising autoencoder (`method = DAE`). ```{r} SEbins <- saseRfindEncodingDim(SEbins, method = "GD") SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD") ``` Having estimated the optimal number of latent factors to include in the regression framework, we can fit our final model, controlling for both known confounders (specified in metadata slot `design`) and the latent factors based on the deviance residuals. The final regression fit can be done with edgeR (`fit = edgeR`), but is set to default to our fast parameter estimation (`fit = fast`). This fast parameter estimation uses a overdispersed quadratic mean-variance relationship, which enables matrix multiplication while keeping parameter estimation unbiased.This is often required in the context of rare diseases that need many latent factors to control for, which leads to ordinary fitting procedures such as edgeR becoming slow. If the number of confounders included in the regression model is small (known confounders + latent factors), edgeR parameter estimation could be used, although it is not expected to give an increase in performance. `analysis` is specified on `AS` to choose an aberrant splicing analysis. ```{r} SEbins <- saseRfit(SEbins, analysis = "AS", padjust = "BH", fit = "fast") SEjunctions <- saseRfit(SEjunctions, analysis = "AS", padjust = "BH", fit = "fast") ``` This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised features within each sample is possible using for example: ```{r} order <- apply(assays(SEbins)$pValue, 2, order, decreasing = FALSE) topPvalsBins <- sapply(1:ncol(SEbins), function(x) {assays(SEbins)$pValue[order[1:5,x],x]}) rownames(topPvalsBins) <- NULL topBins <- sapply(1:ncol(SEbins), function(x) {rownames(SEbins)[order[1:5,x]]}) significantBins <- sapply(1:ncol(SEbins), function(x) { rownames(SEbins)[assays(SEgenes)$pValueAdjust[,x] < 0.05]}) head(topPvalsBins) head(topBins) head(significantBins) ``` Furthermore, also aggregated p-values are calculated based on the newly formed `locus` column in the rowData slot. These are currently the minimal p-value of each feature belonging to the same aggregation group, and can be found in the metadata slot under `pValuesLocus`. ```{r} order <- apply(metadata(SEbins)$pValuesLocus, 2, order, decreasing = FALSE) topPvalsBinsAggregated <- sapply(1:ncol(SEbins), function(x) {metadata(SEbins)$pValuesLocus[order[1:5,x],x]}) rownames(topPvalsBinsAggregated) <- NULL topBinsAggregated <- sapply(1:ncol(SEbins), function(x) { rownames(metadata(SEbins)$pValuesLocus)[order[1:5,x]]}) significantBinsAggregated <- sapply(1:ncol(SEbins), function(x) { rownames(SEbins)[metadata(SEgenes)$pValuesLocus[,x] < 0.05]}) head(topPvalsBinsAggregated) head(topBinsAggregated) head(significantBinsAggregated) ``` # Differential usage/splicing using adapted offsets in DESeq2 and edgeR Here, we show how to perform a differential usage/splicing analysis, using adapted offsets in conventual bulk RNA-seq tools DESeq2/edgeR. This way, we avoid modelling both the counts and the other counts, which means that no sample-specific regression coefficients are estimated. Note that these steps are similar compared to a conventual DESeq2/edgeR analysis, where different offsets or normalisation is used. We refer to the DESeq2/edgeR vignettes for more information upon their package and functions. Also, we refer to the DEXSeq [@Anders] for more information upon the preprocessing of the annotation. ## Load libraries ```{r, message=FALSE, warning=FALSE} library(ASpli) library(DESeq2) library(DEXSeq) library(edgeR) library(dplyr) library(GenomicAlignments) library(GenomicFeatures) ``` ## Load example data We use the same example dataset as in the aberrant analyses, and are loaded from the ASpli package. This includes an annotation file and BAM-files. ```{r} gtfFileName <- aspliExampleGTF() BAMFiles <- aspliExampleBamList() ``` Next, we extract the exonic parts from the annotation file using the `GenomicFeatures` package. ```{r} genomeTxDb <- makeTxDbFromGFF(gtfFileName) flattenedAnnotation = exonicParts(genomeTxDb, linked.to.single.gene.only=TRUE ) ``` Using `summarizeOverlaps`, we count the number of overlaps with each part of exonic parts of the annotation file. This way, we obtain a SummarizedExperiment object with read counts for each bin. These bin counts can then be further used to model differential usage/splicing. Make sure to use the correct settings for `singleEnd` and `ignore.strand`. ```{r} se = summarizeOverlaps( flattenedAnnotation, BAMFiles, singleEnd=TRUE, ignore.strand=TRUE ) ``` Now that we have a count matrix, we can further specify the known covariates, whichin this case is the condition between which we will test differential bin usage. ```{r} colData(se)$condition <- factor(c(rep(c("control", "case"), each = 6))) ``` Instead of modelling differential usage/splicing by modelling both the counts and the other counts, we rather calculate the total bin-counts per gene. This will be further used as offsets in the negative binomial regression, allowing to model the proportion of bin counts over the total counts overlapping that gene. We here aggregate the read counts per gene, although the offset can be chosen flexible depending on the analysis of interest (e.g. aggregating transcript counts per gene to model differential transcript usage). ```{r} counts <- assays(se)$counts offsetsGene <- aggregate(counts, by = list("gene" = rowData(se)$gene_id), FUN = sum) offsets <- offsetsGene[match(rowData(se)$gene_id, offsetsGene$gene), ] %>% mutate("gene" = NULL) %>% as.matrix() ``` Because currently it is not possible to include offsets that are equal to 0 in DESeq2/edgeR, we have to change these offsets. Here, we change the counts that correspond to an offset of 0 (these counts are also equal to 0 in a differential usage/splicing analysis) to 1, together with the corresponding offset. It is possible to change the offsets and/or counts to a number of choice, as long as the offsets are not equal to 0. ```{r} index <- offsets == 0 counts[index] <- 1 offsets[index] <- 1 ``` We can now perform a conventual DESeq2/edgeR analysis, using the custom offsets. For DESeq2, we include the offsets on the original scale, while for edgeR, we include these on the log-scale. For the DESeq2 analysis we can use the following code: ```{r} dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData(se), rowData = rowData(se), design= ~ condition) normalizationFactors(dds) <- offsets dds <- DESeq2::estimateDispersionsGeneEst(dds) dispersions(dds) <- mcols(dds)$dispGeneEst dds <- DESeq2::nbinomWaldTest(dds) results_dds <- DESeq2::results(dds) ``` While for the edgeR analysis we can use the following code: ```{r} DGE <- DGEList(counts = counts) DGE$offset <- log(offsets) design <- model.matrix(~condition, data = colData(se)) DGE <- edgeR::estimateDisp(DGE, design = design) fitDGE <- edgeR::glmFit(DGE, design= design) results_DGE <- edgeR::glmLRT(fitDGE, coef = 2) ``` These results now have an interpretation in terms of proportions, with the interpretation dependent on the offsets used. # Session info ```{r} sessionInfo() ``` # References