---
package: SpatialCPie

title: Spatial transcriptomics cluster analysis with SpatialCPie

author:
- name: Joseph Bergenstråhle
  affiliation:
    Science for Life Laboratory, Department of Gene Technology, School of
    Biotechnology, Royal Institute of Technology (KTH), SE-106 91 Solna, Sweden
- name: Ludvig Bergenstråhle
  affiliation:
    Science for Life Laboratory, Department of Gene Technology, School of
    Biotechnology, Royal Institute of Technology (KTH), SE-106 91 Solna, Sweden
- name: Joakim Lundeberg
  affiliation:
    Science for Life Laboratory, Department of Gene Technology, School of
    Biotechnology, Royal Institute of Technology (KTH), SE-106 91 Solna, Sweden

vignette: >
  %\VignetteIndexEntry{SpatialCPie}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

output: BiocStyle::html_document
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "# >",
    dev = "jpeg"
)
```

Notice to users using OSX with R-devel
============

SpatialCPie depends on the ggiraph package, for which their is no r-devel version available at the moment. 
This means that SpatialCPie will not work on OSX with R-devel. 

Introduction
============

SpatialCPie is an R package designed to facilitate cluster evaluation for
Spatial Transcriptomics (ST) data by providing intuitive visualizations that
display the relationship between clusters in order to guide the user during
cluster identification, selection and further downstream applications

Usage example
=============

```{r, include = FALSE}
library(rlang)
library(dplyr)

# Patch SpatialCPie::runCPie so that it returns a mock output object instead of
# starting the web server
assignInNamespace(
    ns = "SpatialCPie",
    x = "runCPie",
    value = function(
        counts,
        image = NULL,
        spotCoordinates = NULL,
        margin = "spot",
        resolutions = 2:4,
        assignmentFunction = function(k, x) kmeans(x, centers =  k)$cluster,
        scoreMultiplier = 1.0
    ) {
        data <- SpatialCPie:::.preprocessData(
            counts,
            coordinates = spotCoordinates,
            margin = margin,
            resolutions = resolutions,
            assignmentFunction = assignmentFunction
        )
        arrayPlots <- lapply(
            setNames(
                nm = levels(data$scores$resolution) %>%
                    purrr::keep(~ . > 1)
            ),
            function(r) {
                scores <- data$scores %>% filter(.data$resolution == r)
                p <- SpatialCPie:::.arrayPlot(
                    scores = scores %>%
                        select(.data$spot, .data$name, .data$score),
                    coordinates = data$coordinates,
                    image =
                        if (!is.null(image) && !is.null(data$coordinates))
                            grid::rasterGrob(
                                image,
                                width  = grid::unit(1, "npc"),
                                height = grid::unit(1, "npc"),
                                interpolate = TRUE
                            )
                        else NULL,
                    scoreMultiplier = scoreMultiplier,
                    spotOpacity = if (is.null(image)) 1.0 else 0.7
                ) +
                    ggplot2::guides(
                        fill = ggplot2::guide_legend(title = "Cluster")
                    ) +
                    ggplot2::scale_fill_manual(
                        values = data$colors,
                        labels = unique(scores$cluster)
                    )
            }
        )
        list(
            clusters = data$assignments %>% select(-.data$name),
            clusterGraph = SpatialCPie:::.clusterGraph(
                data$assignments,
                data$means,
                data$featureName,
                transitionLabels = TRUE,
                transitionThreshold = 0.05
            ) +
                ggplot2::scale_color_manual(values = data$colors),
            arrayPlot = arrayPlots
        )
    }
)
```

```{r}
library(SpatialCPie)
set.seed(42)
```

Input data
----------

In this example, we will use a downsampled dataset from an experiment on the
human heart.
We begin by loading the count data[^data_pipeline]:

```{r}
counts <- read.table(
    system.file("extdata", "counts.tsv", package = "SpatialCPie"),
    sep = "\t",
    check.names = FALSE
)
counts[1:5, 1:5]
```

[^data_pipeline]: Generated by the [ST
pipeline](https://github.com/SpatialTranscriptomicsResearch/st_pipeline)

To overlay the data on the tissue, we also need to load the tissue image and its
corresponding spot data[^spot_pipeline], which specifies the pixel coordinates
of each spot:

```{r}
tissue <- jpeg::readJPEG(
    system.file("extdata", "he_image.jpg", package = "SpatialCPie")
)
spots <- parseSpotFile(
    system.file("extdata", "spot_data.tsv", package = "SpatialCPie")
)
head(spots)
```

[^spot_pipeline]: Generated by the [ST spot
detector](https://github.com/SpatialTranscriptomicsResearch/st_spot_detector).

Preprocessing
-------------

Typically, it's good to conduct some data filtering prior to the analysis:
This could include removing spots that are outside of the tissue or removing
spots or genes that have a low number of reads.

Since the spot file only contains the spots that are under the tissue, we can
use it to subset the counts:

```{r}
counts <- counts[, which(colnames(counts) %in% rownames(spots))]
```

Let's also remove all spots and genes that have less than 20 reads in total:

```{r}
repeat {
    d <- dim(counts)
    counts <- counts[rowSums(counts) >= 20, colSums(counts) >= 20]
    if (all(dim(counts) == d)) {
        break
    }
}
```

Our data has already been normalized.
If you are using raw count data, it is usually a good idea to normalize it before proceeding.
We have found `SCTransform` in the [Seurat](https://satijalab.org/seurat/) package to work well for Spatial Transcriptomics data.

Computing cluster assignments
-----------------------------

The cluster assignments will be calculated during the loading of the gadget. The cluster assignments are labels assigned to each spot over different *cluster resolutions*, where we use the terminology "(cluster) resolution k" to refer to
a partitioning of the spots into $k$ clusters.

By default, clustering is performed with `base::kmeans` at resolutions `2:4`.
The arguments `resolutions` and `assignmentFunction` to `runCPie` can be used to customize these defaults.
For example, to instead cluster over resolutions 3 and 5 with `cluster::pam`, call `runCPie` as follows:

```{r, eval=FALSE, echo=TRUE}
runCPie(
    counts,
    resolutions = c(3, 5),
    assignmentFunction = function(k, x) cluster::pam(x, k = k)$clustering
)
```

Use `?runCPie` for a complete list of function options.

Visualization
-------------

We are now ready to use the gadget to visualize the data:

```{r}
result <- runCPie(
    counts,
    image = tissue,
    spotCoordinates = spots
)
```

This opens up the gadget window, which has two main elements: the **cluster graph** and the **spatial array plots**.
The cluster plot is interactive and by selecting rows of nodes, the corresponding spatial array plots are also displayed. These can be viewed by scrolling down. The information contained in these plots will be discussed in the next section. There are a number of parameters that can be changed within the gadget which affects the visual representation.

To exit the gadget, press the "Done" button.
The plots in the gadget and the cluster assignments are stored in the output
object, which has the following structure:

```{r}
str(result, max.level = 1)
```

### Cluster plot

```{r}
result$clusterGraph
```

The cluster graph is a visual representation of how the spatial features
transition from clusters of lower resolution to clusters of higher resolution.
The edge opacities show the proportion of spots that transition to or from a
given cluster while node radii display the size of each cluster.

When launching the gadget, spots are relabeled so as to minimize the number of
crossovers (label switches) between resolutions in the data.
Moreover, cluster color labels are selected so that dissimilar clusters are
farther away in color space.

### Spatial array plots

```{r}
result$arrayPlot$`4`
```

The spatial array plots show the spatial locations of the clusters.
The pie chart proportions are based on the following scoring scheme.

#### Spot-based clustering

In the case of spot-based clustering (`margin = "spot"` in the call to
`runCPie`, which is the default), the score for cluster $k$ in spot $s$ is
computed as:

$$
\mbox{score}(s, k) = \mbox{exp}\left(
    - \mbox{RMSD}\left(x_s, \mbox{mean}({\{x_{s'}\}_{s' \in C(k)}})\right)
\right),
$$

where $x_i$ is the gene expression vector of spot $i$, $C(k)$ is the set of
spots in cluster $k$, and $\mbox{RMSD}(a, b)$ is the root-mean-square deviation
between gene vectors $a$ and $b$.

#### Gene-based clustering

In the case of gene-based clustering (`margin = "gene"` in the call to
`runCPie`), the score is computed as:

$$
\mbox{score}(s, k) = \mbox{mean}\left(
    {\{\tilde{x}_{sg}\}_{g \in C(k)}}
\right),
$$

where $\tilde{x}_{sg}$ is a log-normalized relative count value for gene $g$ in
spot $s$.

#### The score multiplier

The score can be tuned by a *score multiplier*, $\lambda$, to either enhance or
decrease the contrast between clusters.
The final (visualized) score is an exponentiation of $\mbox{score}$:

$$
\mbox{score'}(s, k) = \mbox{score}(s, k) ^ \lambda
$$

Note that the definition implies that it is possible to (practically) hard-label
the spots by setting $\lambda$ to a high value.
This follows since the relative score for the highest scoring cluster will tend
towards $1$ as $\lambda\to\infty$.
For example, setting $\lambda=10.0$ yields

```{r, echo=FALSE, message=FALSE}
result <- runCPie(
    counts,
    image = tissue,
    spotCoordinates = spots,
    scoreMultiplier = 10.0
)
result$arrayPlot$`4`
```

Sub-clustering
--------------

After a "global" analysis of the entire tissue section, it is often interesting
to sub-cluster parts of the tissue.
This can be achieved via cluster selection in the gadget and subsequent
sub-sampling of the clusters that the user wants to re-cluster further.

As a practical example, from the array plot above, we can see that clusters 1
and 4 in resolution 4 appear in mostly the same spatial areas and, from the
cluster graph, we can also see that they are mostly derived from the same
lower-resolution cluster.
Therefore, it could be interesting to sub-cluster the spots assigned to these
clusters:

```{r}
library(dplyr)
subcluster <-
    result$clusters %>%
    filter(resolution == 4, cluster %in% c(1, 4)) %>%
    `$`("unit")
head(subcluster)
```

```{r}
subclusterResult <- runCPie(
    counts = counts[, subcluster],
    image = tissue,
    spotCoordinates = spots
)
subclusterResult$clusterGraph
```

The cluster graph indicates that we appear to pick up three different clusters of
stromal cells, as there is a clear separation between the clusters up to and
including resolution 3.
In contrast, clusters in resolution 4 show a pattern of shared ancestry,
indicating that cluster assignments are becoming more ambiguous.

```{r}
subclusterResult$arrayPlot$`3`
```