--- title: "Introduction to `tpSVG`" author: - name: Boyi Guo affiliation: "Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{intro_to_tpSVG} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.small = TRUE, # Reduce figure size fig.retina = NULL # Reduce figure size ) ``` The objective of `tpSVG` is to detect spatially variable genes (SVG) when analyzing spatially-resolved transcriptomics data. This includes both unsupervised features where there's not additional information is supplied besides the (normalized) gene counts and spatial coordinates, but also the spatial variation explained besides some covariates, such as tissue anatomy or possibly cell type composition. Compared to previous SVG detection tools, `tpSVG` provides a scalable solution to model the gene expression as counts instead of logarithm-transformed counts. While log transformation provides convenience to model the spatial gene expression by mapping count data to the continuous domain, hence enabling well-understood Gaussian models, log transformation distorts low expressed genes counts and create bias populating high-expressed genes. For example, the rank of genes based on their effect size are commonly used for dimensional reduction, or its input. Hence, estimating gene ranking correctly is very important. Gaussian models, exemplified with `nnSVG`, often dissociates the mean-variance relationship which is commonly assumed for counts data, and hence often prioritizes the highly expressed genes over the lowly expressed genes. In the figure below, we saw that `nnSVG` is susceptible to such mean-rank relationship, meaning highly expressed genes are often ranked highly. In contrast, the proposed `tpSVG` with Poisson distribution is not susceptible to this mean-rank relationship when examining the [DLPFC dataset](https://pubmed.ncbi.nlm.nih.gov/33558695/). ![Model counts data with `tpSVG` avoids mean-rank relationship.](../inst/figures/mean_rank.png) # Installation ## GitHub You can install the development version of tpSVG from [GitHub](https://github.com/boyiguo1/tpSVG) with: ```{r install_github, eval=FALSE} # install.packages("devtools") devtools::install_github("boyiguo1/tpSVG") ``` ## Bioconductor (pending) The package is currently submitted to Bioconductor for [review](https://github.com/Bioconductor/Contributions/issues/3264). Once the package is accepted by Bioconductor, you can install the latest release version of `tpSVG` from Bioconductor via the following code. Additional details are shown on the Bioconductor page. ```{r install_bioc, eval = FALSE} if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("tpSVG") ``` The latest development version can also be installed from the `devel` version of Bioconductor or from GitHub following ```{r install_bioc_dev,eval = FALSE} BiocManager::install(version='devel') ``` # Modeling spatially resolved gene expression using `tpSVG` In the following section, we demonstrate the how to use `tpSVG` with a [`SpatialExperiment`](https://bioconductor.org/packages/release/bioc/html/SpatialExperiment.html) object. ## Load Packages To run the demonstration, there a couple of necessary packages to load. We will use the data set in [`STexampleData`](https://bioconductor.org/packages/STexampleData), which contains a 10x Visium dataset. Specifically, we will find one 10xVisium sample collected from the dorso lateral prefrontal region of a postmortem human brain from [`STexampleData`](https://bioconductor.org/packages/release/data/experiment/html/STexampleData.html) package, indexed by brain number of "151673" from [Maynard, Collado-Torres _et al_. (2021)](https://www.nature.com/articles/s41593-020-00787-0). For more information, please see the vignettes of [`STexampleData`](https://bioconductor.org/packages/release/data/experiment/vignettes/STexampleData/inst/doc/STexampleData_overview.html) ```{r load_packages, message = FALSE} library(tpSVG) library(SpatialExperiment) library(STexampleData) # Example data library(scuttle) # Data preprocess ``` ## Data processing Before running the analysis using `tpSVG`, we will preprocess the data, which includes 1) calculating normalization factor, or equivalently library size; 2) down-sizing the number of genes to 6 to reduce running time. These preprocessing step may not be necessary if real world data analysis. ```{r data_preprocess, message = FALSE} spe <- Visium_humanDLPFC() spe <- spe[, colData(spe)$in_tissue == 1] spe <- logNormCounts(spe) # Normalization factor head(spe$sizeFactor) # Equivalently, library size spe$total <- counts(spe) |> colSums() # Down-sizing genes for faster computation idx <- which( rowData(spe)$gene_name %in% c("MOBP", "PCP4", "SNAP25", "HBB", "IGKC", "NPY") ) spe <- spe[idx, ] ``` ## Modeling raw counts with Poisson model The following example demonstrates how to model raw gene counts data with `tpSVG`. The model fitting is simple, following `stats::glm` syntax to follow any distributional assumption via the argument `family`. The only key point to mention here is the model needs to account for any technical variation due to the gene profiling procedure. To account for such techincal variation, we use `offset` term in the model. In the following example, we use the commonly used library size as the normalizing factor, and hence set `offset = log(spe$total)` to account for the techinical variation in the data. Equivalently, it is also possible/encouraged to use `offset = log(spe$sizeFactor)`, where `spe$sizeFactor` is calculated during `logNormCounts` and is a linear function of the library size, i.e. `spe$total`. Note: it is very important to use the log function in the `offset` making sure the data scale is conformable. ```{r pois_eg} set.seed(1) spe_poisson <- tpSVG( spe, family = poisson, assay_name = "counts", offset = log(spe$total) # Natural log library size ) rowData(spe_poisson) ``` ## Gaussian model for log-transformed normalized data `tpSVG` provides flexibility regarding the distributional assumption. If interested, it is possible to model log-transformed count data using Gaussian distribution. ```{r, eval = FALSE} spe_gauss <- tpSVG( spe, family = gaussian(), assay_name = "logcounts", offset = NULL ) ``` ## Covariate-adjusted Model It is scientifically interesting to understand if and how much additional spatial variation in gene expression when accounting some known biology. For example, if known anatomy is accounted for in the model, is there any additional spatial variation in the gene expression which can be informative to any unknown biology. Statistically, this question is known as covariate adjustment, where the known biology is quantified and accounted for in a model. To address this question, `tpSVG` allows introducing covariates in the model via the argument `X`, where `X` takes a vector of any kind, including categorical variables. The frist step is to remove any missing data in the dataset, specifically as the covariate. This can be done via `complete.cases`. ```{r} # Check missing data idx_complete_case <- complete.cases(spe$ground_truth) # If multiple covariates # idx_complete_case <- complete.cases(spe$ground_truth, spe$cell_count) # Remove missing data spe <- spe[, idx_complete_case] # Create a design matrix x <- spe$ground_truth spe_poisson_cov <- tpSVG( spe, X = x, family = poisson, assay_name = "counts", offset = log(spe$total) # Natural log library size ) ``` ### image-based SRT in `SpatialExperiment` (e.g. `SpatialFeatureExperiment`) `tpSVG` can be also used to model image-based SRT data. We use the seqFISH data from [Lohoff and Ghazanfar _et al_. (2020)](https://www.nature.com/articles/s41587-021-01006-2) to demonstrate `tpSVG`. Specifically, we use the curated example data in [`STexampleData`](https://bioconductor.org/packages/release/data/experiment/html/STexampleData.html) package. For more information, please see the vignettes of [`STexampleData`](https://bioconductor.org/packages/release/data/experiment/vignettes/STexampleData/inst/doc/STexampleData_overview.html) ```{r load_seqFISH} library(STexampleData) spe <- seqFISH_mouseEmbryo() spe ``` The example data set contains `351` genes for `11026` genes. To make the demonstration computationally feasible, we down-size the number of genes to 1. The average computation times for 11026 cells is roughly 2 minutes. ```{r model_seqFISH} # Calculate "library size" spe$total <- counts(spe) |> colSums() # Down-size genes idx_gene <- which( rowData(spe)$gene_name %in% c("Sox2") ) library(tpSVG) # Poisson model tp_spe <- tpSVG( input = spe[idx_gene,], family = poisson(), offset = log(spe$total), assay_name = "counts") rowData(tp_spe) ``` # Session Info ```{r session} sessioninfo::session_info() ```