--- title: 'Interoptability between MAST and SingleCellExperiment-derived packages.' author: Andrew McDavid date: "`r Sys.Date()`" bibliography: mast-interopt.bib package: MAST output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Interoptability between MAST and SingleCellExperiment-derived packages} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Introduction As a `SingleCellExperiment`-derived package, `MAST` can easily be inserted into workflows with packages such as `scran`, `scater`, `zinbwave`, `SCnorm` and others. Moreover, subclassing SingleCellExperiment/SummarizedExperiment provides a flexible abstraction for the `assay` that contains the actual expression data. It can use sparse `Matrix` and HDF5 as backends to save memory. To use MAST with such packages, you just need to upcast the `SingleCellExperiment` to MAST's subclass `SingleCellAssay` with the function `SceToSingleCellAssay` that handles the coercion and checks the object for validity. Going the other direction, generally `SingleCellAssay`s should work in packages that use `SingleCellExperiment`, but if in doubt you could down-cast with `as(sca, 'SingleCellExperiment')`. ## Log-transformation is expected in MAST The main gotcha in all this is that some SingleCellExperiment-derived packages assume integer counts have been provided, while MAST assumes that log-transformed approximately scale-normalized data is provided. We find that MAST performs best with log-transformed, scale-normalized data that has been thresholded, such as $\log_2(\text{transcripts per million} + 1)$. We address this by: - testing for log-like data for objects up-cast to `SingleCellAssay` - explicitly naming the slot of the `assay` containing such putatively log-like data - by default operating on the slot with such log-like data ## Examples In what follows, we show an example of using `scater` to plot some QC metrics, `SCnorm` to normalize data, and, and conversion to a `Seurat` object. # From MAST to Scater Scater @scater is a package that provides functions for QC, normalization and visualization of single cell RNAseq data. ```{r init, results = 'hide'} library(MAST) knitr::opts_chunk$set(message = FALSE,error = FALSE,warning = FALSE) data(maits, package='MAST') unlog <- function(x) ceiling(2^x - 1) sca_raw = FromMatrix(t(maits$expressionmat), maits$cdat, maits$fdat) assays(sca_raw)$counts = unlog(assay(sca_raw)) assayNames(sca_raw) ``` Here we make an object with assays `counts` and `et`. By default, `MAST` will operate on the `et` assay, but scran wants count-like data for some of its QC. The `et` data are log2 + 1 transcripts per million (TPM), as output by RSEM. We could specify the assay name at creation with `sca_raw = FromMatrix(list(logTPM = t(maits$expressionmat)), maits$cdat, maits$fdat)` or rename an object that contains appropriately transformed data with `assayNames(sca_raw) = c('logTPM', 'counts')`. Before calling `scater` functionality, you might pause to consider if some features should belong in special `control` sets, such as mitochrondial genes, or spike-ins. ```{r scaterQC,results='hide'} library(scater) sca_raw = addPerCellQC(sca_raw) plotColData(sca_raw, y="detected", x="total") ``` Evidently some features were filtered, so not all cells contain 1 million counts. ```{r} sca_raw = runPCA(sca_raw, ncomponents=5, exprs_values = 'et') plotReducedDim(sca_raw, dimred = 'PCA', colour_by = 'condition') ``` We can also run a PCA. ## From scater to MAST Since scater uses `SingleCellExperiment` objects, the only here consideration is making sure `MAST` can find log-like data, and possibly thresholding the data. ```{r} example_sce = mockSCE() example_sce = logNormCounts(example_sce) sca = SceToSingleCellAssay(example_sce) ``` Here we coerce `example_sce` to be a SingleCellAssay object. ```{r} zlm( ~ Treatment, sca = sca, exprs_value = 'logcounts') ``` We test for differential expression with regards to `Treatment` and explicitly indicate the `logcounts` slot will be used. Methods in MAST will operate on the default slice returned by `assay`, which has been over-ridden to return log-like data: the default slice is the first assay whose name, as given by `assayNames(x)`, matches the first element in the sequence `c('thresh', 'et', 'Et', 'lCount', 'logTPM', 'logCounts', 'logcounts')`. So in the case of `sca`, even if `exprs_value` was not specified, the `logcounts` slot would have been used, even though it comes second in `assayNames(sca)`: ```{r} assayNames(sca) ``` ## Sparse matrix and HDF5 support ```{r} library(Matrix) sca_sparse = FromMatrix( exprsArray = list(et = Matrix(t(maits$expressionmat), sparse = TRUE)), cData = maits$cdat, fData = maits$fdat) class(assay(sca_sparse)) regular_time = system.time(zlm( ~ condition, sca = sca_raw[1:100,])) sparse_time = system.time(zlm( ~ condition, sca = sca_sparse[1:100,])) ``` There is no complication to providing a sparse matrix. ```{r} library(DelayedArray) library(HDF5Array) hd5_dat = as(t(maits$expressionmat), 'HDF5Array') DelayedArray::seed(hd5_dat) ``` Write `sc_example_counts` to disk as an `HDF5Array` ```{r} sca_delay = FromMatrix( exprsArray = list(et = hd5_dat), cData = maits$cdat, fData = maits$fdat) class(assay(sca_delay)) hd5_time = system.time(zlm( ~ condition, sca = sca_delay[1:100,])) ``` Nor is there a complication to using HDF5-backed stores. ```{r} knitr::kable(data.frame(method = c('Dense', 'Sparse', 'HDF5'), 'user time(s)' =c( regular_time[1], sparse_time[1], hd5_time[1]), check.names = FALSE)) ``` Dense storage is generally fastest, followed by the sparse storage. HDF5 is often slowest, but if your data doesn't fit in memory, you don't really have any other choice. The linear models underlying MAST don't have any special provision for big $n$ data, and will tend to linearly (or worse) in the number of cells. So performance may be an issue even if they data do fit in memory. # MAST and ZINB-wave ```{r zinbwave} library(zinbwave) feature_var = apply(assay(sca_raw), 1, var) sca_top500 = sca_raw[rank(-feature_var)<=500,] zw = zinbwave(Y = sca_top500, X = '~1', which_assay = 'counts', K = 2, normalizedValues = TRUE) ``` Run zinbwave. To speed things, we take the top 500 most variable genes. ```{r, message=FALSE} rd = data.frame(reducedDim(zw, 'PCA'), reducedDim(zw, 'zinbwave'), colData(zw)) GGally::ggpairs(rd, columns = c('PC1', 'PC2', 'W1', 'W2'), mapping = aes(color = condition)) ``` ## Using MAST to characterizing genes that drive the factors ```{r, results = 'hide'} colData(zw) = cbind(colData(zw), reducedDim(zw, 'zinbwave')) zw = SceToSingleCellAssay(zw) zz = zlm(~W1 + W2, sca = zw, exprs_values = 'et') ``` ```{r, results = 'asis'} ss = summary(zz) knitr::kable(print(ss)) ``` These are log-fold changes in the top few changes associated with factors 1 and 2. ```{r} library(dplyr) library(data.table) top5 = ss$datatable %>% filter(component=='logFC', contrast %like% 'W') %>% arrange(-abs(z)) %>% head(n=5) %>% left_join(rowData(zw) %>% as.data.table()) dat = zw[top5$primerid,] %>% as('data.table') dat = dat[,!duplicated(colnames(dat)),with = FALSE] plt = ggplot(dat, aes(x=W1, color = condition)) + geom_point() + facet_wrap(~symbolid) ``` ```{r} plt + aes(y = et) ``` Expression on "Et" scale ($\log_2( TPM + 1)$) ```{r} plt + aes(y = normalizedValues) ``` Normalized expression from factor model # References