---
title: "SpotClean adjusts for spot swapping in spatial transcriptomics data"
authors: "Zijian Ni and Christina Kendziorski"
package: SpotClean
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document:
        toc: true
vignette: >
    %\VignetteIndexEntry{SpotClean}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}  
---


```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
set.seed(1)
```


# Introduction

Spatial transcriptomics (ST), named [Method of the Year 2020](https://www.nature.com/articles/s41592-020-01033-y) by *Nature Methods* in 2020, is a powerful and widely-used experimental method for profiling genome-wide gene expression across a tissue. In a typical ST experiment, fresh-frozen (or FFPE) tissue is sectioned and placed onto a slide containing spots, with each spot containing millions of capture oligonucleotides with spatial barcodes unique to that spot. The tissue is imaged, typically via Hematoxylin and Eosin (H&E) staining. Following imaging, the tissue is permeabilized to release mRNA which then binds to the capture oligonucleotides, generating a cDNA library consisting of transcripts bound by barcodes that preserve spatial information. Data from an ST experiment consists of the tissue image coupled with RNA-sequencing data collected from each spot. A first step in processing ST data is tissue detection, where spots on the slide containing tissue are distinguished from background spots without tissue. Unique molecular identifier (UMI) counts at each spot containing tissue are then used in downstream analyses.

Ideally, a gene-specific UMI at a given spot would represent expression of that gene at that spot, and spots without tissue would show no (or few) UMIs. This is not the case in practice. Messenger RNA bleed from nearby spots causes substantial contamination of UMI counts, an artifact we refer to as spot swapping. On average, we observe that more than 30% of UMIs at a tissue spot did not originate from this spot, but from other spots contaminating it. Spot swapping confounds downstream inferences including normalization, marker gene-based annotation, differential expression and cell type decomposition.

We developed **SpotClean** to adjust for the effects of spot swapping in ST experiments. SpotClean is able to measure the per-spot contamination rates in observed data and decontaminate gene expression levels, thus increases the sensitivity and precision of downstream analyses. Our package `SpotClean` is built based on 10x Visium spatial transcriptomics experiments, currently the most widely-used commercial protocol, providing functions to load raw spatial transcriptomics data from 10x Space Ranger outputs, decontaminate the spot swapping effect, estimate contamination levels, visualize expression profiles and spot labels on the slide, and connect with other widely-used packages for further analyses. SpotClean can be potentially extended to other spatial transcriptomics data as long as the gene expression data in both tissue and background regions are available. `SpotClean` is compatible with the [`SpatialExperiment`](https://bioconductor.org/packages/release/bioc/html/SpatialExperiment.html) class and supports downstream analyses via [`Seurat`](https://satijalab.org/seurat/).

As a computational tool for analyzing spatial transcriptomics data, `SpotClean` has been submitted to [Bioconductor](https://www.bioconductor.org/), making the R package well-documented and well-maintained for improved reproducibility and user experience.

# Quick Start

## Installation

Install from Bioconductor:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SpotClean")

```

Load package after installation:

```{r}
library(SpotClean)
library(S4Vectors)
```

## Short Demo

Here we quickly demonstrate the general SpotClean workflow on the built-in example data. A step-by-step illustration can be found in the next section.

Note: codes in this chunk are only for illustrative purpose and are not runnable. Runnable codes can be found in the next section.

```{r, eval=FALSE}
# Not run

# Load 10x Visium data
mbrain_raw <- read10xRaw("/path/to/matrix/folder")
mbrain_slide_info <- read10xSlide("/path/to/tissue/csv", 
                                  "/path/to/tissue/image", 
                                  "/path/to/scale/factor")

# Visualize raw data
mbrain_obj <- createSlide(count_mat = mbrain_raw, 
                          slide_info = mbrain_slide_info)
visualizeSlide(slide_obj = mbrain_obj)
visualizeHeatmap(mbrain_obj,rownames(mbrain_raw)[1])

# Decontaminate raw data
decont_obj <- spotclean(mbrain_obj)

# Visualize decontaminated gene
visualizeHeatmap(decont_obj,rownames(mbrain_raw)[1])

# Visualize the estimated per-spot contamination rate
visualizeHeatmap(decont_obj,metadata(decont_obj)$contamination_rate, 
                 logged = FALSE, legend_title = "contamination rate",
                 legend_range = c(0,1))

# (Optionally) Transform to Seurat object for downstream analyses
seurat_obj <- convertToSeurat(decont_obj,image_dir = "/path/to/spatial/folder")
```

## Working with the `SpatialExperiment` class

[`SpatialExperiment`](https://bioconductor.org/packages/release/bioc/html/SpatialExperiment.html) is another widely used S4 class for storing ST data, including the count matrix, cell and gene-level metadata, and tissue image. `spotclean()` can be directly applied to `SpatialExperiment` objects constructed from `SpatialExperiment::read10xVisium()`, which reads in the raw data from 10x Visium Space Ranger output. The visualization functions in our package currently do not support `SpatialExperiment` class, but there are alternative options such as [`ggspavis`](https://bioconductor.org/packages/release/bioc/html/ggspavis.html).

```{r, eval=F}
# Not run
library(SpatialExperiment)
slide_obj <- read10xVisium(samples = "/path/to/spaceranger/output/", 
                         data = "raw") # must specify data = "raw"
decont_obj <- spotclean(slide_obj)
str(assays(decont_obj)$decont)
```

## Running Speed

The computational speed is related to the size and structure of input datasets, mainly driven by the number of tissue spots. SpotClean does not require parallel computation, and thus does not use up too many CPU or memory resources. As a reference, SpotClean running on a medium-size dataset (around 30,000 genes and 2,000 tissue spots) under default gene filtering takes less than 15 minutes.

## Situations you should think twice about before applying SpotClean

- **Too many tissue spots (not enough background spots)** While the observed data is a single matrix with a fixed number of columns (spots), the number of unknown parameters is proportional to the number of tissue spots. In the extreme case where all spots are covered by tissue, we have more unknown parameters than observed data values. In this case the contaminated expressions are confounded with true expressions, and SpotClean estimation becomes unreliable. We recommend that the input data have at least 25% spots not occupied by tissue, so that SpotClean has enough information from background spots to estimate contamination.

- **Lowly-expressed genes** Lowly-expressed genes typically contain relatively less information and relatively more noise than highly-expressed genes. SpotClean by default only keeps highly-expressed or highly-variable genes for decontamination (or both). It can be forced to run on manually-specified lowly-expressed genes. However, even in this case, expression for the lowly-expressed genes is typically not changed very much. Given the high sparsity in most lowly expressed genes, there is not enough information available to confidently reassign UMIs in most cases. However, we do not filter genes by sparsity because there can be interesting genes highly concentrated in a small tissue region. In cases like this, SpotClean is effective at adjusting for spot swapping in these regions.  If the defaults are not appropriate, users can either adjust the expression cutoffs or manually specify genes to decontaminate.

- **Inference based on sequencing depth** SpotClean reassigns bled-out UMIs to their tissue spots of origin which changes the estimated sequencing depth of tissue spots after decontamination, since most estimations of sequencing depth rely on total expressions at every spot. As a result, decontamination can be considered as another type of normalization and might conflict with existing sequencing depth normalization methods.

## Recommended applications

SpotClean leads to improved estimates of expression by correcting for spot swapping. In other words, SpotClean reduces noise by enhancing signal in highly expressed regions and reducing measured signal in unexpressed regions. Consequently, SpotClean will improve the accuracy of marker gene-based inferences including tissue type annotation, cell type decomposition, integration with single-cell RNA-seq data, and associated downstream analyses.  SpotClean also improves the identification of spatially variable and differentially expressed (DE) genes.  We note that in some cases, the p-values associated with known DE genes may increase slightly following SpotClean due to increased variability within the spot groups (i.e. truly expressed regions become more highly expressed; and signal is removed from unexpressed regions).
 
SpotClean will not alter clusters substantially in most datasets given that clusters are largely determined by relatively few highly expressed genes. While clusters may become slightly better defined, in most cases we do not see big differences in the number of clusters and/or relationships among clusters after applying SpotClean.

# Detailed Steps

## Load count matrix and slide information from 10x Space Ranger output

Currently, the most widely-used spatial transcriptomics protocol is 10x Visium. Public 10x datasets can be found [here](https://support.10xgenomics.com/spatial-gene-expression/datasets/). In this vignette, we illustrate the usage of `SpotClean` using the [V1_Adult_Mouse_Brain](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Adult_Mouse_Brain) Visium data. 

Two parts of 10x Space Ranger output files are required as input to `SpotClean`: the raw gene-by-spot count matrix, and the slide metadata. In this example, you can download and unzip the [Feature / cell matrix (raw)](https://cf.10xgenomics.com/samples/spatial-exp/1.0.0/V1_Adult_Mouse_Brain/V1_Adult_Mouse_Brain_raw_feature_bc_matrix.tar.gz) and [Spatial imaging data](https://cf.10xgenomics.com/samples/spatial-exp/1.0.0/V1_Adult_Mouse_Brain/V1_Adult_Mouse_Brain_spatial.tar.gz) from [V1_Adult_Mouse_Brain](https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Adult_Mouse_Brain). You will get a folder `raw_feature_bc_matrix` containing `barcodes.tsv.gz`, `features.tsv.gz`, `matrix.mtx.gz`, and a folder `spatial` containing `aligned_fiducials.jpg`, `detected_tissue_image.jpg`, `tissue_hires_image.png`, `tissue_lowres_image.png`, `tissue_positions_list.csv`, `scalefactors_json.json`. The former folder contains the raw gene-by-spot count matrix, and the latter contains slide metadata like slide images and spot IDs and positions.

`SpotClean` provides functions `read10xRaw()` and `read10xSlide()` to directly read these 10x Space Ranger output files and generate a gene-by-spot count matrix and slide metadata in R. Here for the purpose of demonstration, we have provided the count matrix and slide metadata with our package. The count matrix is a built-in example object `mbrain_raw`, which is a subset of the gene-by-spot count matrix from `read10xRaw()`, containing the top 100 highest expressed genes. The slide metadata, located at `extdata/V1_Adult_Mouse_Brain_spatial`, contains the three basic files (`tissue_positions_list.csv`, `tissue_lowres_image.png`, `scalefactors_json.json`) to run SpotClean and other associated packages in this vignette. 

```{r}
# load count matrix
data(mbrain_raw)
str(mbrain_raw)

# read spatial metadata
spatial_dir <- system.file(file.path("extdata",
                                     "V1_Adult_Mouse_Brain_spatial"),
                           package = "SpotClean")
list.files(spatial_dir)

mbrain_slide_info <- read10xSlide(tissue_csv_file=file.path(spatial_dir,
                                       "tissue_positions_list.csv"),
             tissue_img_file = file.path(spatial_dir,
                                       "tissue_lowres_image.png"),
             scale_factor_file = file.path(spatial_dir,
                                       "scalefactors_json.json"))
str(mbrain_slide_info)
```

## Create the slide object

In the following, we will show the `SpotClean` workflow using the built-in example data. 

We combine count matrix and slide metadata together as one single slide object with class `SummarizedExperiment`:

```{r}
slide_obj <- createSlide(mbrain_raw, mbrain_slide_info)
slide_obj
```

As shown above, the raw count matrix is stored in the `raw` assay slot, and slide and image information are stored in `slide` and `grob` metadata slots. 

## Visualize the slide object

Our package provides multiple visualization functions in the 2-D slide space. Function `visualizeSlide()` shows the input slide imaging file (if given) in R:

```{r,fig.width=5, fig.height=5}
visualizeSlide(slide_obj)
```

Function `visualizeLabel()` shows the spot labels. You can specify the column name of character labels in the `slide` metadata, or manually provide a vector of character labels corresponding to each spot. For example, we can plot their tissue/background labels, which has been pre-stored in the input slide information:

```{r,fig.width=5, fig.height=4}
visualizeLabel(slide_obj,"tissue")
```


Function `visualizeHeatmap()` draws a heatmap of values at every spot in the 2-D slide space. Similar to `visualizeLabel()`, you can specify the column name of numerical values in the `slide` metadata, or manually provide a vector of numerical values corresponding to each spot. For example, we can plot the total UMI counts in every spot,:

```{r,fig.width=5, fig.height=4}
metadata(slide_obj)$slide$total_counts <- Matrix::colSums(mbrain_raw)
visualizeHeatmap(slide_obj,"total_counts")
```

You can also provide a certain gene name appearing in the raw count matrix in input slide object to `visualizeHeatmap()`. For example, the expression of the Mbp gene can be visualized:

```{r,fig.width=5, fig.height=4}
visualizeHeatmap(slide_obj,"Mbp")
```

`visualizeLabel()` and `visualizeHeatmap()` both support manual label/value inputs, subsetting spots to plot, title and legend name modification. `visualizeHeatmap()` also supports different color palettes (rainbow vs. viridis) and log-scaling options. These visualization functions return `ggplot2` objects, which can be further modified by users.

## Decontaminate the data

`spotclean()` is the main function for performing decontamination. It takes the slide object of raw data as input together with some parameters for controlling optimization and convergence, and returns a slide object with decontaminated gene expressions and other model-related parameters and statistics appending to the slide information. Detailed parameter explanations can be found by running `?spotclean`. Here we set `maxit=10` and `candidate_radius=20` to save computation time. In practice, `spotclean()` by default evaluates a series of candidate radii and automatically chooses the best one. The default maximum number of iterations is 30, which can be extended if convergence has not been reached.

```{r}
decont_obj <- spotclean(slide_obj, maxit=10, candidate_radius = 20)
```

Check the structure of output slide object:

```{r}
decont_obj
names(metadata(decont_obj))
```

The metadata now contains more information including parameter estimates from the SpotClean model and measurements of contamination levels. 

We can visualize the Mbp gene expressions after 10 iterations of decontamination:

```{r,fig.width=5, fig.height=4}
visualizeHeatmap(decont_obj,"Mbp")
```

## Estimate contamination levels in observed data

Our model is able to estimate the proportion of contaminated expression at each tissue spot (i.e. expression at a tissue spot that originated from a different spot due to spot swapping):

```{r}
summary(metadata(decont_obj)$contamination_rate)
```

This indicates around 30% of UMIs at a tissue spot in the observed data came from spot swapping contamination, averaging across all tissue spots.

## ARC score

We also provide another subjective estimation of contamination level, called the ambient RNA contamination (ARC) score. It can be calculated using function `arcScore()`, and is also part of the decontamination outputs. Intuitively, the ARC score is a conserved lower bound of the proportion of contamination in observed tissue spots. The ARC score can also be applied in droplet-based single-cell data to estimate ambient RNA contamination when replacing background spots with empty droplets. Details can be found by running `?arcScore`.

```{r}
arcScore(slide_obj)
```

This indicates at least 5% expressions in observed data came from spot swapping contamination.

## Convert to Seurat object for downstream analyses

`convertToSeurat()` can be used to convert our slide object to a Seurat spatial object. Note that Seurat requires input of the spatial folder. 

```{r}
seurat_obj <- convertToSeurat(decont_obj,image_dir = spatial_dir)
```

# Session Information

```{r}
sessionInfo()
```

# Citation

`SpotClean` can be cited at:

Ni, Z., Prasad, A., Chen, S. et al. SpotClean adjusts for spot swapping in spatial transcriptomics data. *Nat Commun* 13, 2971 (2022). https://doi.org/10.1038/s41467-022-30587-y