--- title: 'Interoptability between MAST and SingleCellExperiment-derived packages.' author: Andrew McDavid date: "`r Sys.Date()`" package: MAST output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Interoptability between MAST and SingleCellExperiment-derived packages} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Introduction As a SingleCellExperiment-derived package, `MAST` can easily be inserted into workflows with packages such as `scran`, `scater`, `zinbwave`, `SCnorm` and others. The main gotcha is packages that assume integer counts vs log-transformed, or log-transformed, approximately scale-normalized data. We find that MAST performs best with log-transformed, scale-normalized data that has been thresholded, such as $\log_2(\text{transcripts per million} + 1)$. We address this by: - testing for log-like data for objects constructed in MAST - explicitly naming the slot of the `assay` containing such putatively log-like data - by default operating on the slot with such log-like data In objects that were constructed in other packages, we ... In what follows, we show an example of using `scater` to plot some QC metrics, `SCnorm` to normalize data, and, and conversion to a `Seurat` object. # From MAST to Scater Scater (citation) is a package that ... ```{r, results = 'hide'} library(MAST) knitr::opts_chunk$set(message = FALSE,error = FALSE,warning = FALSE) data(maits, package='MAST') unlog <- function(x) ceiling(2^x - 1) sca_raw = FromMatrix(t(maits$expressionmat), maits$cdat, maits$fdat) assays(sca_raw)$counts = unlog(assay(sca_raw)) assayNames(sca_raw) ``` Here we make an object with assays `counts` and `et`. By default, `MAST` will operate on the `et` assay, but scran wants count-like data for some of its QC. The `et` data are log2 + 1 transcripts per million (TPM), as output by RSEM. We could specify the assay name at creation with `sca_raw = FromMatrix(list(logTPM = t(maits$expressionmat)), maits$cdat, maits$fdat)` or rename an object that contains appropriately transformed data with `assayNames(sca_raw) = c('logTPM', 'counts')`. Before calling `scater` functionality, you might pause to consider if some features should belong in special `control` sets, such as mitochrondial genes, or spike-ins. ```{r scaterQC,results='hide'} library(scater) sca_raw = calculateQCMetrics(sca_raw) plotRowData(sca_raw, x = 'log10_mean_counts', 'pct_dropout_by_counts') plotColData(sca_raw, y="total_features_by_counts", x="total_counts") ``` Evidently some features were filtered, so not all cells contain 1 million counts. We can tell these were rare features based on the inverse relationship between `total_counts` and `total_features_by_counts`: the most complex libraries (with the greatest numer of features) are missing the most counts. ```{r} sca_raw <- runPCA(sca_raw, ncomponents=5, exprs_values = 'et') plotReducedDim(sca_raw, use_dimred = 'PCA', colour_by = 'condition') ``` We can also run a PCA. ## From scater to MAST ```{r} data(sc_example_counts) ``` # WIP: From MAST to ZINB-wave <!-- ```{r zinbwave} --> <!-- library(zinbwave) --> <!-- feature_var = apply(assay(sca_raw), 1, var) --> <!-- sca_top500 = sca_raw[rank(-feature_var)<=500,] --> <!-- zw <- zinbwave(Y = sca_top500, X = '~condition', which_assay = 'counts') --> <!-- ``` --> <!-- Run zinbwave. To speed things, we take the --> <!-- top 500 most variable genes. --> <!-- ```{r} --> <!-- ggpairs(data.frame(zw = reducedDim(zw), colData(zw)), columns = c('zw.PC1', 'zw.PC2', 'ngeneson'), mapping = aes(color = condition)) --> <!-- ``` --> <!-- ## Using MAST to characterizing genes that drive the factors --> <!-- ```{r, results = 'hide'} --> <!-- assays(zw)[[1]] <- log2(assay(zw)+1) --> <!-- colData(zw) <- cbind(colData(zw), zw=as.data.frame(reducedDim(zw))) --> <!-- zw <- zw %>% as('SingleCellAssay') --> <!-- zz <- zlm(~W1+W2, sca=zw) --> <!-- ``` --> <!-- ```{r, results = 'asis'} --> <!-- ss <- summary(zz) --> <!-- knitr::kable(print(ss)) --> <!-- ``` --> <!-- These are log-fold changes in the top few changes associated with factors 1 and 2. --> <!-- ```{r} --> <!-- top5 <- ss$datatable %>% filter(component=='logFC', contrast %like% 'W') %>% arrange(-abs(z)) %>% head(n=5) %>% left_join(rowData(zw) %>% as.data.table()) --> <!-- plt <- ggplot(zw[top5$primerid,] %>% as('data.table'), aes(x=W1, color = condition)) + geom_point() + facet_wrap(~symbolid) --> <!-- ``` --> <!-- ```{r} --> <!-- plt + aes(y = et) --> <!-- ``` --> <!-- Expression on "Et" scale ($\log_2( TPM + 1)$) --> <!-- ```{r} --> <!-- plt + aes(y = normalizedValues) --> <!-- ``` --> <!-- Normalized expression from factor model -->