---
title: 'Main vignette: Aberrant expression and splicing analysis'
author:
- name: Alexandre Segers
- name: Jeroen Gilis
- name: Mattias Van Heetvelde
- name: Elfride De Baere
- name: Lieven Clement
bibliography: saseR.bib
date: "23/08/2023"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
  BiocStyle::pdf_document: default
package: saseR
abstract: |
  Main vignette for the saseR package. This vignette aims to provide a detailed 
  description of an aberrant expression and splicing analysis using saseR. 
  Also, we show how to perform differential usage analysis using adapted 
  offsets in DESeq2 and edgeR.
vignette: >
  %\VignetteIndexEntry{Main vignette: saseR analyses}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
library(knitr)
```

# Introduction

`saseR` is a R package to perform aberrant expression and splicing analyses 
in bulk RNA-seq datasets. It uses adapted offsets to model aberrant splicing
with the negative binomial framework, and estimates the parameters on a 
scalable and efficient manner. In this vignette, we first show how the 
different functions of the `saseR` package can be used, after which we show 
how to perform `differential usage` analysis with adapted offsets in 
conventional bulk RNA-seq methods DESeq2 and edgeR.

All details about the `saseR` model, statistical tests, the use of adapted 
offsets to model proportions and the adapted parameter estimation are 
described in our preprint [@Segers2023].

In this vignette, we analyse the data from the `ASpli` package [@ASpli],
to show how a workflow for aberrant expression and splicing is done. 
However, we note that this dataset is not specific for aberrant expression or 
splicing analyses. The data consists of BAM-files which will be converted into 
gene, bin,and junction reads, on which three different `saseR` analysis will 
be performed (aberrant expression and two times an aberrant splicing 
analysis). Next, we show how a differential usage/splicing analysis can be 
done using DESeq2 [@Love] and edgeR [@Robinson] with adapted offsets.


# Aberrant expression and splicing workflow

The package has two different categories of functions,
with respectively two and three functions.

First, BAM-files need to be transformed to gene, bin or junctions counts. This
is done by:

1. `BamtoAspliCounts`, which transforms BAM-files to different counts using 
adapted functions of the `ASpli` package. The different counts are given in 
a `SummarizedExperiment` object in the metadata slots. 

2. `convertAspli` selects the relevant counts (gene, bins or junctions) to 
be further used in the analyses.

Then, the chosen counts can be further used to perform an aberrant expression
or splicing analysis with the following functions:

3. `calculateOffsets`, which calculates the offset matrix used in the analysis.
These offsets can be ordinary `edgeR` or `DESeq2` offsets when performing an
aberrant expression analysis, or can be the total bin or junction counts 
aggregated per gene when performing an aberrant splicing analysis. 

4. `saseRfindEncodingDim` estimates the optimal number of latent factors to
control for unknown confounders. This number of latent factors is then further
used in the final fit.

5. `saseRfit` calculates the mean expression and feature-specific dispersions 
for each feature in each sample, controlling for both known and unknown 
confounders. It calculates p-values which represent how extreme each 
observation is compared to its estimated distribution. 

Note that one can do a saseR analysis by using readily available count 
matrices, therefore not requiring to start from BAM-files.


## Package installation
`saseR` can be installed from Bioconductor with:

```{r, eval=FALSE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("saseR")
```

Alternatively, the development version of `saseR` can be downloaded with:
 
```{r, eval=FALSE}
library(devtools)
devtools::install_github("statOmics/saseR")
```


## Load libraries

```{r, message=FALSE, warning=FALSE}
library(saseR)
suppressWarnings(library(ASpli))
library(edgeR)
library(DESeq2)
library(limma)
library(GenomicFeatures)
library(dplyr)
library(pracma)
library(PRROC)
library(BiocParallel)
library(data.table)
library(igraph)
```

## Transforming BAM to count matrices.

The following data corresponds to the vignette data from the `ASpli`
package [@ASpli].

The input data required for saseR are BAM-files with corresponding BAI files 
and a genome annotation in TxDb format. These BAM-files will be transformed to 
gene/bin/junction count matrices. When a count matrix is readily available,
one can just use this to perform an aberrant expression or splicing analysis
(see section 5 and 6) .


First, we specify the path to our GTF file, which we will later transform to
the TxDb format. Also, we include the pathnames of the BAM-files we want to
include in the analysis.

```{r}
gtfFileName <- ASpli::aspliExampleGTF()
BAMFiles <- ASpli::aspliExampleBamList()

gtfFileName
head(BAMFiles)
```

Then, we define a targets dataframe, which contains the sample names, BAM-file 
pathsand experimental factors column. For the case of rare Mendelian disorders 
and outlier detection, we choose
these experimental factors to be all equal to A. Note that the experimental 
factors included here will not further be used in the final analysis, but are
needed to compute the count matrices with functions from `ASpli`. 

```{r}
targets <- data.frame( 
  row.names = paste0('Sample',c(1:12)),
  bam = BAMFiles,
  f1 = rep("A",12),
  stringsAsFactors = FALSE)

head(targets)
```
Next, we transform our GTF file to TxDb format, and extract the features from
the annotation using the `binGenome` function from `saseR`, which is an adapted copy from `ASpli` as this function is deprecated. Note that the 
binGenome function can take a while.

```{r}
genomeTxDb <- txdbmaker::makeTxDbFromGFF(gtfFileName)
features <- binGenome(genomeTxDb, logTo = NULL)
```

Now that we have the `targets` data.frame containing the samples and BAM-file 
names, and the `features` extracted from the genome annotation, we can obtain
the count matrices that will be used to detect aberrant expression. This can 
done with `BamtoAspliCounts`. Note that the counting procedure (featurecounts) 
can take a while. If only one type of analysis is preferred (e.g. only aberrant
expression analysis on the gene counts), one can do the feature counting with a
different method, and start from a count matrix at section 
5 and 6. Make sure to
select the right type of `libType` (`SE` for single-end reads, `PE` for 
paired-end reads), and the `minReadLength`, which is the minimal number of
bases a read needs to be counted.

```{r}
ASpliSE <- BamtoAspliCounts(
                    features = features,
                    targets = targets,
                    minReadLength = 100,
                    libType = "SE",
                    BPPARAM = MulticoreParam(1L)
                    )
```

ASpliSE is a SummarizedExperiment object that contains the gene, bin and 
junction counts, together with different annotations in the metadata slots. We
will extract these counts using the `convertASpli` function. 

```{r}
SEgenes <- convertASpli(ASpliSE, type = "gene")
SEbins <- convertASpli(ASpliSE, type = "bin")
SEjunctions <- convertASpli(ASpliSE, type = "junction")
```


## Aberrant expression analysis
We will show how to perform an aberrant expression analysis based on the gene
level counts. Note that other feature counts (e.g. bin counts) can also be used
to search for aberrant expression. We start from a SummarizedExperiment that
contains the feature counts of interest.

First, a design formula is included in the metadata slot `design`. In this 
example, we do not include known covariates, but only introduce an intercept.
If one includes known covariates in the formula, these covariates should be 
included in the colData slot. 

```{r}
metadata(SEgenes)$design <- ~1
```

For an aberrant expression analysis, we first retain only the genes that have
sufficiently large counts. 
Then, we need to calculate the offsets for our analysis. For aberrant 
expression analysis, we recommend using `TMM` or `geommean` for edgeR and 
DESeq2 normalisation respectively. This results in an offsets matrix in the 
assays slot. 

Note that for aberrant splicing detection, first offsets are calculated before
filtering features that have low counts.


```{r}
filtergenes <- filterByExpr(SEgenes)
SEgenes <- SEgenes[filtergenes,]
```


```{r}
SEgenes <- calculateOffsets(SEgenes, method = "TMM")
```

Having defined the `design` and the `offsets` of the analysis, we can continue
to estimate the optimal `number of latent confounders` included in the 
regression framework. This is often needed when performing aberrant expression 
or splicing analysis to improve power. We obtain the latent factors by using 
`RUV` [@RUV], which performs a 
`singular value decomposition on the deviance residuals`. Then, 
we recommend using the Gavish and Donoho [@GavishDonoho] threshold for singular 
values (`method = GD`) to choose upon the number of latent factors included in
the final regression model. This is much faster, while having similar 
performance compared to using a denoising autoencoder (`method = DAE`). 

```{r}
SEgenes <- saseRfindEncodingDim(SEgenes, method = "GD")
```

Having estimated the optimal number of latent factors to include in the 
regression framework, we can fit our final model, controlling for both known
confounders (specified in metadata slot `design`) and the latent factors 
based on the deviance residuals. The final regression fit can be done with 
edgeR (`fit = edgeR`), but is set to default to our fast parameter estimation
(`fit = fast`). This fast parameter estimation uses a overdispersed quadratic
mean-variance relationship, which enables matrix multiplication while keeping
parameter estimation unbiased. This is often required in the context of rare 
diseases that need many latent factors to control for, which leads to ordinary
fitting procedures such as DESeq2 and edgeR becoming slow. If the number of 
confounders included in the regression model is small (known confounders + 
latent factors), edgeR parameter estimation could be used, although it is not 
expected to give an increase in performance. `analysis` is specified to `AE` 
to choose an aberrant expression analysis.


```{r}
SEgenes <- saseRfit(SEgenes,
                    analysis = "AE",
                    padjust = "BH",
                    fit = "fast")
```

This results in estimates in mean expression, feature specific dispersions
and p-values calculated for each observation in each sample. Further analysing
the top prioritised genes within each sample is possible using for example:

```{r}
order <- apply(assays(SEgenes)$pValue, 2, order, decreasing = FALSE)

topPvalsGenes <- sapply(1:ncol(SEgenes), 
                   function(x) {assays(SEgenes)$pValue[order[1:5,x],x]})
rownames(topPvalsGenes) <- NULL

topGenes <- sapply(1:ncol(SEgenes), 
                   function(x) {rownames(SEgenes)[order[1:5,x]]})

significantgenes <- sapply(1:ncol(SEgenes), 
              function(x) {
                  rownames(SEgenes)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})


head(topPvalsGenes)
head(topGenes)
head(significantgenes)
```

## Aberrant splicing analysis

We will show how to perform an aberrant splicing analysis based on the bin and
junction level counts. Note that other feature counts can also be used if 
proportions of interest are to be modelled, which only requires the use of the
correct offsets. We start from a SummarizedExperiment that contains the 
feature counts of interest.

First, a design formula is included in the metadata slot `design`. In this 
example, we do not include known covariates, but only introduce an intercept.
If one includes known covariates in the formula, these covariates should be 
included in the colData slot. 

```{r}
metadata(SEbins)$design <- ~1
metadata(SEjunctions)$design <- ~1

```

Then, we need to calculate the offsets for our analysis. For an aberrant 
splicing analysis, `AS` or `proportion` should be used as method to calculate 
offsets that allow to model proportions. This results in an offsets matrix in 
the assays slot. We specificy the column in the rowData slot which is used
to aggregate feature counts to obtain proper offsets (`aggregation`). This is 
the `locus` column for the bins analysis, and the `symbol` or `ASpliCluster` 
column for a junctions analysis. For the junctions analysis, it is also 
possible to use aggregation based on `symbol` (= gene) whenever an annotated
gene is available, and switch to `ASpliCluster` when the junction is not linked
to a specific gene. This is done by setting `mergeGeneASpli` equal to TRUE. 
Note that a new column in the rowData will be made, `locus`, which specifies
the used aggregation. 

Note that also other rowData columns can be used to aggregate feature counts, 
and also a predefined offsets matrix can be included as `offsets` in the
assays slot.


```{r}
SEbins <- calculateOffsets(SEbins, 
                            method = "AS", 
                            aggregation = "locus")

SEjunctions <- calculateOffsets(SEjunctions, 
                            method = "AS", 
                            aggregation = "symbol",
                            mergeGeneASpli = TRUE)

```

Then, after calculating the offsets for the aberrant splicing analysis, we
filter the features that have very low counts. Note that for aberrant 
expression analyses, first the features are filtered, followed by calculating 
the offsets.

```{r}
filterbins <- filterByExpr(SEbins)
SEbins <- SEbins[filterbins,]

filterjunctions <- filterByExpr(SEjunctions)
SEjunctions <- SEjunctions[filterjunctions,]
```

Having defined the `design` and the `offsets` of the analysis, we can continue
to estimate the optimal `number of latent confounders` included in the 
regression framework. This is often needed when performing aberrant expression 
or splicing analysis to obtain better power. We obtain the latent factors by 
using `RUV` [@RUV], which performs a 
`singular value decomposition on the deviance residuals`. Then, we recommend 
using the Gavish and Donoho [@GavishDonoho] threshold for 
singular values (`method = GD`) to choose upon the number of
latent factors included in the final regression model. This is much faster, 
while having similar performance compared to using a denoising autoencoder 
(`method = DAE`). 

```{r}
SEbins <- saseRfindEncodingDim(SEbins, method = "GD")
SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD")

```

Having estimated the optimal number of latent factors to include in the 
regression framework, we can fit our final model, controlling for both known
confounders (specified in metadata slot `design`) and the latent factors 
based on the deviance residuals. The final regression fit can be done with 
edgeR (`fit = edgeR`), but is set to default to our fast parameter estimation
(`fit = fast`). This fast parameter estimation uses a overdispersed quadratic
mean-variance relationship, which enables matrix multiplication while keeping
parameter estimation unbiased.This is often required in the context of rare 
diseases that need many latent factors to control for, which leads to ordinary
fitting procedures such as edgeR becoming slow. If the number of confounders
included in the regression model is small (known confounders + latent factors),
edgeR parameter estimation could be used, although it is not expected to give 
an increase in performance. `analysis` is specified on `AS` to choose an 
aberrant splicing analysis.


```{r}
SEbins <- saseRfit(SEbins,
                    analysis = "AS",
                    padjust = "BH",
                    fit = "fast")

SEjunctions <- saseRfit(SEjunctions,
                    analysis = "AS",
                    padjust = "BH",
                    fit = "fast")
```

This results in estimates in mean expression, feature specific dispersions
and p-values calculated for each observation in each sample. Further analysing
the top prioritised features within each sample is possible using for example:

```{r}
order <- apply(assays(SEbins)$pValue, 2, order, decreasing = FALSE)

topPvalsBins <- sapply(1:ncol(SEbins), 
                   function(x) {assays(SEbins)$pValue[order[1:5,x],x]})

rownames(topPvalsBins) <- NULL

topBins <- sapply(1:ncol(SEbins), 
                   function(x) {rownames(SEbins)[order[1:5,x]]})

significantBins <- sapply(1:ncol(SEbins), 
                   function(x) {
                    rownames(SEbins)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})

head(topPvalsBins)
head(topBins)
head(significantBins)
```

Furthermore, also aggregated p-values are calculated based on the newly formed 
`locus` column in the rowData slot. These are currently the minimal p-value
of each feature belonging to the same aggregation group, and can be found in
the metadata slot under `pValuesLocus`.

```{r}
order <- apply(metadata(SEbins)$pValuesLocus, 2, order, decreasing = FALSE)

topPvalsBinsAggregated <- sapply(1:ncol(SEbins), 
                   function(x) {metadata(SEbins)$pValuesLocus[order[1:5,x],x]})
rownames(topPvalsBinsAggregated) <- NULL

topBinsAggregated <- sapply(1:ncol(SEbins), 
                   function(x) {
                     rownames(metadata(SEbins)$pValuesLocus)[order[1:5,x]]})

significantBinsAggregated <- sapply(1:ncol(SEbins), 
              function(x) {
                  rownames(SEbins)[metadata(SEgenes)$pValuesLocus[,x] < 0.05]})

head(topPvalsBinsAggregated)
head(topBinsAggregated)
head(significantBinsAggregated)
```


# Differential usage/splicing using adapted offsets in DESeq2 and edgeR

Here, we show how to perform a differential usage/splicing analysis, using
adapted offsets in conventual bulk RNA-seq tools DESeq2/edgeR. This way, we 
avoid modelling both the counts and the other counts, which means that no
sample-specific regression coefficients are estimated. Note that these steps
are similar compared to a conventual DESeq2/edgeR analysis, where different
offsets or normalisation is used. We refer to the DESeq2/edgeR vignettes for
more information upon their package and functions. Also, we refer to the DEXSeq 
[@Anders] for more information upon the preprocessing of the annotation.

## Load libraries

```{r, message=FALSE, warning=FALSE}
library(ASpli)
library(DESeq2)
library(DEXSeq)
library(edgeR)
library(dplyr)
library(GenomicAlignments)
library(GenomicFeatures)
```

## Load example data

We use the same example dataset as in the aberrant analyses, and are loaded
from the ASpli package. This includes an annotation file and BAM-files.

```{r}
gtfFileName <- aspliExampleGTF()
BAMFiles <- aspliExampleBamList()
```

Next, we extract the exonic parts from the annotation file using the 
`GenomicFeatures` package. 

```{r}
genomeTxDb <- makeTxDbFromGFF(gtfFileName)
flattenedAnnotation = exonicParts(genomeTxDb, 
                                  linked.to.single.gene.only=TRUE )
```

Using `summarizeOverlaps`, we count the number of overlaps with each part of 
exonic parts of the annotation file. This way, we obtain a SummarizedExperiment
object with read counts for each bin. These bin counts can then be further used 
to model differential usage/splicing. Make sure to use the correct settings
for `singleEnd` and `ignore.strand`.

```{r}
se = summarizeOverlaps(
    flattenedAnnotation,
    BAMFiles, 
    singleEnd=TRUE, 
    ignore.strand=TRUE )
```

Now that we have a count matrix, we can further specify the known covariates, 
whichin this case is the condition between which we will test differential 
bin usage.

```{r}
colData(se)$condition <- factor(c(rep(c("control", "case"), each = 6)))

```

Instead of modelling differential usage/splicing by modelling both the counts
and the other counts, we rather calculate the total bin-counts per gene. This
will be further used as offsets in the negative binomial regression, allowing
to model the proportion of bin counts over the total counts overlapping that 
gene. We here aggregate the read counts per gene, although the offset can be 
chosen flexible depending on the analysis of interest (e.g. aggregating 
transcript counts per gene to model differential transcript usage).

```{r}
counts <- assays(se)$counts
offsetsGene <- aggregate(counts,
                     by = list("gene" = rowData(se)$gene_id), FUN = sum) 
offsets <- offsetsGene[match(rowData(se)$gene_id, offsetsGene$gene), ] %>% 
                     mutate("gene" = NULL) %>% 
                     as.matrix() 
```

Because currently it is not possible to include offsets that are equal to 0 in
DESeq2/edgeR, we have to change these offsets. Here, we change the counts that
correspond to an offset of 0 (these counts are also equal to 0 in a 
differential usage/splicing analysis) to 1, together with the corresponding 
offset. It is possible to change the offsets and/or counts to a number of 
choice, as long as the offsets are not equal to 0.

```{r}
index <- offsets == 0
counts[index] <- 1
offsets[index] <- 1

```

We can now perform a conventual DESeq2/edgeR analysis, using the custom 
offsets. For DESeq2, we include the offsets on the original scale, while for 
edgeR, we include these on the log-scale. For the DESeq2 analysis we can use 
the following code:

```{r}
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = colData(se),
                              rowData = rowData(se),
                              design= ~ condition)
normalizationFactors(dds) <- offsets

dds <- DESeq2::estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)$dispGeneEst
dds <- DESeq2::nbinomWaldTest(dds)
results_dds <- DESeq2::results(dds)


```

While for the edgeR analysis we can use the following code:


```{r}
DGE <- DGEList(counts = counts)
DGE$offset <- log(offsets)
design <- model.matrix(~condition, data = colData(se))

DGE <- edgeR::estimateDisp(DGE, design = design)
fitDGE <- edgeR::glmFit(DGE, design= design)
results_DGE <- edgeR::glmLRT(fitDGE, coef = 2)

```

These results now have an interpretation in terms of proportions, with the 
interpretation dependent on the offsets used.

# Session info

```{r}
sessionInfo()
```

# References