---
title: "stPipe: A flexible and streamlined pipeline for processing sequencing-based spatial transcriptomics data"
author: "Yang Xu"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
runtime: shiny
vignette: >
  %\VignetteIndexEntry{stPipe: A flexible and streamlined pipeline for processing sequencing-based spatial transcriptomics data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction to `stPipe` package
`stPipe` is a package designed to process sequencing-based spatial transcriptomics (sST) data generated by different platforms, including 10X Visium, BGI Stereo-seq, and Slide-seq/Curio-seeker. `stPipe` consists of two major components. The first is data pre-processing with paired-end FASTQ files as input and a gene count matrix with matched spatial information as output. The second component takes the pre-processing results as input and performs quality control, downstream object construction and generates an HTML report for overall summary.

## Platform introduction 1: 10X Visium
10X Genomics' Visium platform is one of the most widely adopted methods for sST, offering high-throughput capabilities and compatibility with diverse tissue types. Visium is developed based on the original spatial transcriptomics technology via positioning histological sections on arrayed reverse transcription primers with unique positional barcodes. Visium supports both fresh frozen (FF) and formalin-fixed, paraffin-embedded (FFPE) tissues. Tissue processing can be performed manually or via the automated CytAssist system. Two distinct mRNA capture methodologies are available: probe-based and poly-A-based methods. Visium offers two capture area configurations: 6.5 x 6.5 mm area comprising 4,992 spots and an 11 x 11 mm area containing 14,336 spots. Each spot is 55 µm in diameter, with a center-to-center distance of 100 µm. Recent available Visium HD provides an enhanced spatial resolution of 2 µm without any gaps between.

## Platform introduction 2: Slide-seq / Curio-seeker
Slide-seq uses a method whereby RNA from tissue sections is transferred onto a surface embedded with DNA-barcoded 10 µm beads. The random distribution of beads across the surface results in variable distances between barcodes. Notably, the spatial barcode in Slide-seq is divided into two segments by a linker sequence. Slide-seq V2 modifies the library generation, bead synthesis, and array indexing steps to improve the mRNA capture sensitivity by approximately 10-fold over the original Slide-seq V1 method. Slide-seq has been commercialized as Curio Seeker by Curio Bioscience. 

## Platform introduction 3: BGI Stereo-seq
Based on DNA nanoball (DNB) technology, BGI Stereo-seq (SpaTial Enhanced REsolution Omics-sequencing) resolves the sequencing data via 1 x 1 cm RNA capture unit on the STOmics-GeneExpression Chip. The original DNA nanoball sequencing amplifies small fragments of genomic DNA into DNA nanoballs via rolling circle replication with fluorescent nucleotides binding to complementary nucleotides that are polymerized to anchor sequences bound to known sequences on the DNA template. In the refined Stereo-seq protocol, DNBs are generated by rolling circle amplification of circularized DNA oligonucleotides, forming spherical structures. Each DNB carries a 25-nt coordinate identity (CID) used to identify its spatial position during RNA capture and sequencing. Then the spatial location of each DNB is recorded with diameter of 0.2um and a center-to-center distance of 0.5um which allows for sub-cellular single-cell resolution data. 

## Table Summary for mainstream sST platforms
```{r, summary, echo=FALSE}
table_data <- data.frame(
  Protocol = c("Visium", "Visium HD", "Slide-seq/Curio-seeker", "Stereo-seq", "Curio-seeker"),
  Resolution_micron = c(50, 2, 10, 10, 10),
  Spot_distance_micron = c(100, "0 (no gaps)", 10, 0.5, 10),
  Typical_array_size = c("0.65cm x 0.65cm / 1.1cm x 1.1cm", 
                         "0.65cm x 0.65cm / 1.1cm x 1.1cm", 
                         "/", 
                         "1cm x 1cm; (max: 13.2cm x 13.2cm)", 
                         "0.3cm x 0.3cm / 1cm x 1cm")
)

knitr::kable(table_data, format = "markdown", col.names = c("Protocol", "Resolution (micron)", "Spot distance (micron)", "Typical array size"))
```
# figure
## Case Study 1: Preprocessing 10X Visium data
The sample data used for this study is from a FFPE mouse spleen performed using 10x Visium's probe based assay. It has been downsampled for computing efficiency.
All files can be found in the `extdata` folder of the `stPipe` package. The original data comes from the `SpatialBenchVisium` project and is available under GEO accession number GSE254652, detailed in [Spotlight on 10x Visium: a multi-sample protocol comparison of spatial technologies](https://www.biorxiv.org/content/10.1101/2024.03.13.584910v1.abstract). 

## Getting started

We begin by filling in specific details of this particular sample in the config file. Note it is important to read the explanation inside the sample config file as we will be using it for the majority of `stPipe` implemented functions.

`stPipe` implements different strategies to cope with corresponding platform features, hence the config file needs to be filled in correctly before running `stPipe`. This will be further explained in the following sections.

```{r, setup, message=FALSE}
library(stPipe)
config <- system.file("config", "config_stPipe.yml", package = "stPipe")
yaml_content <- readLines(config)
cat(yaml_content, sep = "\n")
```

## Organising config file & data for `stPipe::Run_ST`

The implemented `Run_ST` function streamlines multiple steps into one cohesive process, including FASTQ or BAM file reformatting where barcodes and UMI information are extracted and incorporated into the read headers, read alignment, barcode demultiplexing, and gene counting. `stPipe` performs different strategies for probe-based and polyA-based protocols, the former one utilizes probe set version to construct annotation files such as fa and GFF3 files while the latter one relies the annotation files specified by the user. 

The read structure for 10X Visium has the 8 bp long spatial barcode starting at position 6 and the 6 bp long UMI sequence starting at the first position. So the read structure will be : `list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6)`. `bs1=-1, bl1=0` means we don't have an index in read 1 so we set a negative value as its start position and give it zero length. `bs2=6, bl2=8` means we have an index in read two which starts at position 6 in the read and is 8 bases in length. `us=0, ul=6` means we have a UMI at the start of read 2 which is 6 bases long. **NOTE**: we use a zero based index system, so the indexing of the sequence starts at zero.

We begin by creating a temporary folder to download the demo data.

```{r, input}
# FASTQ file path:
data_directory <- file.path(tempdir(), "stPipe_demo_data")
if (!dir.exists(data_directory)) {
  dir.create(data_directory, recursive = TRUE)
}
options(timeout = 600)
R1_url <- "https://zenodo.org/records/14920583/files/stPipe_demo_R1.fq.gz?download=1"
R2_url <- "https://zenodo.org/records/14920583/files/stPipe_demo_R2.fq.gz?download=1"
R1_path <- file.path(data_directory, "stPipe_demo_R1.fq.gz")
R2_path <- file.path(data_directory, "stPipe_demo_R2.fq.gz")
download.file(R1_url, R1_path)
download.file(R2_url, R2_path)
config <- system.file("config", "config_stPipe.yml", package = "stPipe")
config <- yaml::read_yaml(config)
config$data_directory <- data_directory
```

We then create a temporary directory to store results before we're ready to run Run_ST and adjust some config information for this sample. **NOTE**: In most functions there will be a flag called `show.config` which simply prints all config information and helps you make sure that everything is configured correctly and reduces the risk of encountering issues later.

```{r, yaml, eval=TRUE}
output_directory <- file.path(tempdir(), "stPipe_output")
if (!dir.exists(output_directory)) {
  dir.create(output_directory, recursive = TRUE)
}
config$output_directory <- output_directory
demo_config_path <- tempfile(fileext = ".yml")
yaml::write_yaml(config, demo_config_path)
```

Now `stPipe` is ready to run.

```{r, Run_ST,eval=TRUE}
Run_ST(config = demo_config_path, show.config = FALSE)
```

## Organising config file for `stPipe::Run_Loc_Match`

The `Run_Loc_Match` function is responsible for pairing the gene expression matrix with match spatial locations. The key input in the config file is **technology_version** which specifies the used platform. In further segmentation, each platform has its own argument: Visium needs **visium_coordinates** to match internally stored spatial coordinates system and **image_path** is optional for pixel computation; Slide-seq/Curio-seeker needs **bead_location** which should be a `.csv` file with three columns, where the first column has the spatial barcode sequence, the second column contains the X coordinate, and the third column contains the Y coordinate to match bead location across the spatial dimension. Stereo-seq only requires **technology_version**.

```{r, Run_Loc_Match, eval=TRUE}
st.result <- Run_Loc_Match(config = demo_config_path, pixel = FALSE, show.config = FALSE)
head(st.result$matched_data)
```

## Organising config file for `stPipe::Run_QC`

At this stage, the pre-processing work is almost done. But under specific case, you might want to do some QC filtering, for example, to clean up background noise in 10X Visium scenario. The `Run_QC` function helps with this purpose.

```{r, qc, eval=TRUE}
# adjust some parameter and update the config file
config$qc_per <- c("0.4_0.6")
yaml::write_yaml(config, demo_config_path)
```

Under this filtering criteria, more than 65% of the original spots are lost which might be concerning.

```{r, Run_QC, eval=TRUE, warning=FALSE}
qc.results <- Run_QC(config = demo_config_path, matched.data = st.result$matched_data, gene.matrix = st.result$gene_count, show.config = FALSE)
dim(qc.results$filtered_spatial_coords)[1]/dim(st.result$matched_data)[1]
```

## Organising config file for `stPipe::Run_Clustering`

Once you are satisfied with the data, the `Run_Clustering` function can be utilized to perform some preliminary clustering analysis, including UMAP and T-SNE. The clustering function itself prints message every 50 iteration, to avoid long printout only 15 are shown.

```{r, Run_Clustering, eval=TRUE}
output <- capture.output(
  Cluster.results <- Run_Clustering(
    gene.count = st.result$gene_count,
    matched.data = st.result$matched_data,
    num_clusters = 5
  )
)
cat(output[seq_along(1:15)], sep = "\n")
```

## Organising config file for `stPipe::Run_Interactive`

Previously the `Run_QC` function serves as a step to get rid of spatial locations with low quality reads, but sometimes you might want to select a specific spatial region and perform further analysis. In this case, consider to use the `Run_Interactive` function which provides you with an interactive R shiny interface for personalized selection.

```{r, Run_Interactive, eval=TRUE}
 Run_Interactive(
   matched_data = st.result$matched_data,
   clustering_result = Cluster.results$tsne_df,
   background_img = NULL,
   reduction_method = "tsne",
   point_size = 1
)
```

## Organising config file for `stPipe::Run_Visualization`

`Run_Visualization` is the step where you want to check how the data looks like after a series of pre-processing steps. This function contains two flags for visualization, `Vis.spatial` and `Vis.read` corresponds to spatial visualization and sequencing visualization respectively. The former one contains UMI heatmap in raw value and log format. The latter includes barcode demultiplexing statistics, distribution of UMI duplication numbers, and alignment statistics.

```{r, Run_Visualization, eval=TRUE}
Vis.results <- Run_Visualization(
  matched.data = st.result$matched_data,
  config = demo_config_path,
  Vis.spatial = TRUE,
  Vis.read = TRUE,
  show.config = FALSE
)
```

***Barcode demultiplexing statistics*** this can suggest the percentage of reads that uniquely match to the spatial barcodes, as well as the unmatched proportion of reads and their alignment rates to introns and exons.

```{r, demultiplex_plot, eval=TRUE, fig.height=7, fig.width=7}
Vis.results$demultiplex_plot
```

***Mapping statistics***  this plot shows the mapping quality of sample. In this demo case, the 10X Visium data is generated as probe-based, hence either mapped to exon or unaligned will be observed. For polyA-based protocols, mixed variable should be expected. 

```{r, mapping_plot, eval=TRUE, fig.height=7, fig.width=7}
Vis.results$mapping_plot
```

***Spatial UMI Plot*** this is plotted as both raw and log UMI count format, suggesting the UMI spatial distribution across the captured tissue.

```{r, UMI_plot, echo=FALSE, fig.show='hold', fig.height=5, fig.width=5}
Vis.results$raw_UMI_plot
Vis.results$log_UMI_plot
```

## Summary report and downstream object

After all the previous work, a HTML report in RMarkDown format can be generated to summarize your data and save it for further documentation archive purposes. The `Run_HTML` function can help with this. **NOTE**: You will need to include ALL previous outputs to ensure the successful operation of this function.

Upstream pre-processing is never the end of exploring data in research, the `Run_Create_Obj` function is designed for further downstream analysis, catering for different pipelines such as `r BiocStyle::CRANpkg("Seurat")`, `r BiocStyle::Githubpkg("scverse/squidpy")`, and `r BiocStyle::Biocpkg("SpatialExperiment")` supported bioconductor packages.

```{r, seu_obj}
library(Seurat)
library(SeuratObject)
seu <- Run_Create_Obj(
   gene.matrix = st.result$gene_count,
   matched.data = st.result$matched_data,
   obj.type = 'Seurat',
   tech = 'Visium'
 )
seu
```


<details>
<summary>Session Info</summary>
```{r, sessionInfo}
sessionInfo()
```