---
title: "Reproducing the original dandelion method/paper"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
    number_sections: true
knitr:
  opts_chunk:
    dev: 'png'
    fig.align: 'left'
date: "`r Sys.Date()`"
vignette: >
  %\VignetteIndexEntry{Reproducing the original dandelion method/paper}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error = FALSE, message = FALSE, warning = FALSE)
library(BiocStyle)
```

In this vignette, we will demonstrate how to perform TCR trajectory analysis 
starting from data that has already been processed by `dandelion` in python. 
This is to demonstrate that the original method/results in the `dandelion` 
paper can be reproduced.

## Load the required libraries

We will also load `scRepertoire` and `scater` for the analysis.
```{r, message=FALSE, warning=FALSE}
library(dandelionR)
library(scRepertoire)
library(scater)
```

## Load data
First, we will load the demo data. This is a down-sampled dataset from
[Suo et al 2024](https://www.nature.com/articles/s41587-023-01734-7).

It contains 10,000 cells with the TCR information and the dimensionality 
reduced data (scVI) that we need for this tutorial. The gene expression matrix 
is not required for this tutorial so it is not included in the demo data. We 
will show in a separate tutorial how to start with data from `scRepertoire`.
```{r, warning = FALSE}
data(sce_vdj)
```
We will set the seed so that the plots and results are consistent.
```{r}
set.seed(100)
```

## Filter the data

To begin, we will filter the data and extract the TCR information so that we 
can construct pseudobulks. 

Because the `colData` of this single-cell object is populated with the TCR 
information from `dandelion` in python (through this method: `dandelion` -> 
`anndata` -> `anndata2ri`, which essentially converts an `AnnData` object in 
python to `SingleCellExperiment` in R), we can directly use the 
`setupVdjPseudobulk` function to extract the TCR information and construct 
the pseudobulks.

Here, we also need to specify the `allowed_chain_status` to keep to the 
relevant contigs. Default for `allowed_chain_status` is `NULL`, which will 
keep all contigs. In the standard R workflow that starts from `scRepertoire`, 
we will assume that all the QC and filtering has already been handled by 
`scRepertoire`.

```{r, warning = FALSE}
sce_vdj <- setupVdjPseudobulk(sce_vdj,
    already.productive = FALSE,
    allowed_chain_status = c(
        "Single pair", "Extra pair",
        "Extra pair-exception", "Orphan VDJ",
        "Orphan VDJ-exception"
    )
)
```

We can visualise the UMAP of the filtered data.
```{r, warning = FALSE}
plotUMAP(sce_vdj, color_by = "anno_lvl_2_final_clean")
```

## Milo object and neighbourhood graph construction

We will use miloR to create the pseudobulks based on the gene expression data. 
The goal is to construct a neighbourhood graph with many neighbors with which 
we can sample the representative neighbours to form the objects.
```{r, warning = FALSE}
library(miloR)
traj_milo <- Milo(sce_vdj)
milo_object <- buildGraph(traj_milo, k = 50, d = 20, reduced.dim = "X_scvi")
milo_object <- makeNhoods(milo_object, reduced_dims = "X_scvi", d = 20)
```
### Construct UMAP on milo neighbor graph

We can visualise this milo object using UMAP.
```{r, warning = FALSE}
milo_object <- miloUmap(milo_object)
```
```{r}
plotUMAP(milo_object,
    color_by = "anno_lvl_2_final_clean",
    dimred = "UMAP_knngraph"
)
```

## Construct pseudobulked VDJ feature space

Next, we will construct the pseudobulked VDJ feature space using the 
neighbourhood graph constructed above. We will also run PCA on the 
pseudobulked VDJ feature space.
```{r}
pb.milo <- vdjPseudobulk(milo_object, col_to_take = "anno_lvl_2_final_clean")

pb.milo <- runPCA(pb.milo, assay.type = "Feature_space")
```

We can visualise the PCA of the pseudobulked VDJ feature space.
```{r}
plotPCA(pb.milo, color_by = "anno_lvl_2_final_clean")
```

## TCR trajectory inference using Absorbing Markov Chain

In the original `dandelion` python package, the trajectory inference is done 
using the `palantir` package. Here, we implement the absorbing markov chain 
approach in dandelionR to infer the trajectory, leveraging on `destiny` for 
diffusion map computation.

### Define root and branch tips
```{r}
library(SingleCellExperiment)

# extract the PCA matrix
pca <- t(as.matrix(reducedDim(pb.milo, type = "PCA")))
# define the CD8 terminal cell as the top-most cell and CD4 terminal cell as
# the bottom-most cell
branch.tips <- c(which.max(pca[2, ]), which.min(pca[2, ]))
names(branch.tips) <- c("CD8+T", "CD4+T")
# define the start of our trajectory as the right-most cell
root <- which.max(pca[1, ])
```

### Construct diffusion map
```{r, warning = FALSE}
library(destiny)
# Run diffusion map on the PCA
dm <- DiffusionMap(t(pca), n_pcs = 50, n_eigs = 10)
```

### Compute diffussion pseudotime on diffusion map
```{r}
dif.pse <- DPT(dm, tips = c(root, branch.tips), w_width = 0.1)
```

```{r, message=FALSE}
# the root is automatically called DPT + index of the root cell
DPTroot <- paste0("DPT", root)
# store pseudotime in milo object
pb.milo$pseudotime <- dif.pse[[DPTroot]]
# set the colours for pseudotime
pal <- colorRampPalette(rev((RColorBrewer::brewer.pal(9, "RdYlBu"))))(255)
plotPCA(pb.milo, color_by = "pseudotime") +
    scale_colour_gradientn(colours = pal)
```

### Markov chain construction on the pseudobulk VDJ feature space

```{r}
pb.milo <- markovProbability(
    milo = pb.milo,
    diffusionmap = dm,
    terminal_state = branch.tips,
    root_cell = root,
    pseudotime_key = "pseudotime"
)
```

### Visualising branch probabilities

With the Markov chain probabilities computed, we can visualise the branch 
probabilities towards CD4+ or CD8+ T-cell fate on the PCA plot.
```{r, message=FALSE}
plotPCA(pb.milo, color_by = "CD8+T") + scale_color_gradientn(colors = pal)
plotPCA(pb.milo, color_by = "CD4+T") + scale_color_gradientn(colors = pal)
```

## Transfer

The next step is to project the pseudotime and the branch probability 
information from the pseudobulks back to each cell in the dataset. If the cell 
do not belong to any of the pseudobulk, it will be removed. If a cell belongs 
to multiple pseudobulk samples, its value should be calculated as a weighted 
average of the corresponding values from each pseudobulk, where each weight is 
inverse of the size of the pseudobulk.

### Project pseudobulk data to each cell
```{r}
cdata <- projectPseudotimeToCell(milo_object, pb.milo, branch.tips)
```

### Visualise the trajectory data on a per cell basis
```{r, message=FALSE}
plotUMAP(cdata, color_by = "anno_lvl_2_final_clean", dimred = "UMAP_knngraph")
plotUMAP(cdata, color_by = "pseudotime", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD4+T", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
plotUMAP(cdata, color_by = "CD8+T", dimred = "UMAP_knngraph") +
    scale_color_gradientn(colors = pal)
```

And that's it! We have successfully inferred the trajectory of the T-cells in 
this dataset!

## Session info

```{r, warning = FALSE}
sessionInfo()
```