---
title: "Overview of the condiments workflow"
author: "Hector Roux de Bézieux"
date: '`r format(Sys.time(), "%d %B , %Y")`'
bibliography: ref.bib
output: 
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{The condiments workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r packages, include=F}
library(knitr)
opts_chunk$set(
  fig.pos = "!h", out.extra = "",
  fig.align = "center"
)
```

# Initial pre-processing
## Generating a synthetic dataset 
We will use a synthetic dataset to illustrate the functionalities of the _condiments_ package. We start directly with a dataset where the following steps are assumed to have been run:

+ Obtaining count matrices for each setting (i.e. each condition).
+ Integration and normalization between the conditions.
+ Reduced Dimension Estimations
+ (Clustering)

```{r, warning=FALSE, message=F}
# For analysis
library(condiments)
library(slingshot)
# For data manipulation
library(dplyr)
library(tidyr)
# For visualization
library(ggplot2)
library(RColorBrewer)
library(viridis)
set.seed(2071)
theme_set(theme_classic())
```

```{r}
data("toy_dataset", package = "condiments")
df <- toy_dataset$sd
```

As such, we start with a matrix `df` of metadata for the cells: coordinates in a reduced dimension space `(Dim1, Dim2)`, a vector of conditions assignments `conditions` (A or B) and a lineage assignment.

## Vizualisation

We can first plot the cells on the reduced dimensions

```{r}
p <- ggplot(df, aes(x = Dim1, y = Dim2, col = conditions)) +
  geom_point() +
  scale_color_brewer(type = "qual")
p
```

We can also visualize the underlying skeleton structure of the two conditions.

```{r}
p <- ggplot(df, aes(x = Dim1, y = Dim2, col = conditions)) +
  geom_point(alpha = .5) +
  geom_point(data = toy_dataset$mst, size = 2) +
  geom_path(data = toy_dataset$mst, aes(group = lineages), size = 1.5) +
  scale_color_brewer(type = "qual") + 
  facet_wrap(~conditions) +
  guides(col = FALSE)
p
```

# Differential Topology
## Exploratory analysis

We can then compute the __imbalance score__ of each cell using the *imbalance_score* function.

```{r}
scores <- imbalance_score(Object = df %>% select(Dim1, Dim2) %>% as.matrix(),
                          conditions = df$conditions)
df$scores <- scores$scores
df$scaled_scores <- scores$scaled_scores
```

There are two types of scores. The raw score is computed on each cell and looks at the condition distribution of its neighbors compared the the overall distribution. The size of the neighborhood can be set using the `k` argument, which specify the number of neighbors to consider. Higher values means more local imbalance. 

```{r}
ggplot(df, aes(x = Dim1, y = Dim2, col = scores)) +
  geom_point() +
  scale_color_viridis_c(option = "C")
```

The local scores are quite noisy so we can then use local smoothers to smooth the scores of individual cells. The smoothness is dictated by the `smooth` argument. Those smoothed scores were also computed using the `imbalance_score` function.

```{r}
ggplot(df, aes(x = Dim1, y = Dim2, col = scaled_scores)) +
  geom_point() +
  scale_color_viridis_c(option = "C")
```

As could be guessed from the original plot, the bottom lineage shows a lot of imbalance while the top one does not. The imbalance score can be used to check:
+ If the integration has been successful. In general, some regions should be balanced
+ To identify the regions of imbalance for further analyses.

## Trajectory Inference

The first step of our workflow is to decide whether or not to infer the trajectories separately or not. On average, it is better to infer a common trajectory, since a) this allow for a wider range of downstream analyses, and b) more cells are used to estimate the trajectory. However, the condition effect might be strong enough to massively disrupt the differentiation process, which would require fitting separate trajectories. 

__slingshot__[@Street2018a] relies on a reduced dimensionality reduction representation of the data, as well as on cluster labels. We can visualize those below: 

```{r}
ggplot(df, aes(x = Dim1, y = Dim2, col = cl)) +
  geom_point()
```
The __topologyTest__ assess the quality of the common trajectory inference done by slingshot and test whether we should fit a common or separate trajectory. This test relies on repeated permutations of the conditions followed by trajectory inference so it can take a few seconds.

```{r}
rd <- as.matrix(df[, c("Dim1", "Dim2")])
sds <- slingshot(rd, df$cl)
## Takes ~1m30s to run
top_res <- topologyTest(sds = sds, conditions = df$conditions)
knitr::kable(top_res)
```

The test clearly fails to reject the null that we can fit a common trajectory so we can continue with the `sds` object. This will facilitate downstream analysis. For an example of how to proceed if the __topologyTest__ reject the null, we invite the user to refer to [relevant case study used in our paper](https://hectorrdb.github.io/condimentsPaper/articles/KRAS.html).

We can thus visualize the trajectory

```{r}
ggplot(df, aes(x = Dim1, y = Dim2, col = cl)) +
  geom_point() +
  geom_path(data =  slingCurves(sds, as.df = TRUE) %>% arrange(Order),
            aes(group = Lineage), col = "black", size = 2)
```

# Differential Progression

Even though we can fit a common trajectory, it does not mean that the cells will differentiate similarly between the conditions. The first question we can ask is: for a given lineage, are cells equally represented along pseudotime between conditions? 

```{r}
psts <- slingPseudotime(sds) %>%
  as.data.frame() %>%
  mutate(cells = rownames(.),
         conditions = df$conditions) %>%
  pivot_longer(starts_with("Lineage"), values_to = "pseudotime", names_to = "lineages")
```

## Visualization

```{r, warning=FALSE}
ggplot(psts, aes(x = pseudotime, fill = conditions)) +
  geom_density(alpha = .5) +
  scale_fill_brewer(type = "qual") +
  facet_wrap(~lineages) +
  theme(legend.position = "bottom")
```

The pseudotime distributions are identical across conditions for the first lineage but there are clear differences between the two conditions in the second lineage. 

## Testing for differential progression

To test for differential progression, we use the __progressionTest__. The test can be run with `global = TRUE` to test when pooling all lineages, or `lineages = TRUE` to test every lineage independently, or both. Several tests are implemented in the __progressionTest__. function. Here, we will use the default, the custom KS test [@smirnov1939estimation]. 

```{r}
prog_res <- progressionTest(sds, conditions = df$conditions, global = TRUE, lineages = TRUE)
knitr::kable(prog_res)
```

As expected, there is a global difference over all lineages, which is driven by differences of distribution across lineage 2 (i.e. the bottom one).

# Differential fate selection

Even though we can fit a common trajectory, it does not mean that the cells will differentiate similarly between the conditions. The first question we can ask is: for a given lineage, are cells equally between the two lineages for the two conditions? 

## Vizualisation

Visualizing differences 2D distributions can be somewhat tricky. However, it is important to note that the sum of all lineage weights should sum to 1. As such, we can only plot the weights for the first lineage.

```{r}
df$weight_1 <- slingCurveWeights(sds, as.probs = TRUE)[, 1]
ggplot(df, aes(x = weight_1, fill = conditions)) +
  geom_density(alpha = .5) +
  scale_fill_brewer(type = "qual") +
  labs(x = "Curve weight for the first lineage")
```

The distribution has tri modes, which is very often the case for two lineages:
+ A weight around 0 represent a cell that is mostly assigned to the other lineage (i.e. lineage 2 here)
+ A weight around .5 represent a cell that is equally assigned to both lineages, i.e. before the bifurcation. 
+ A weight around 1 represent a cell that is mostly assigned to this lineage (i.e. lineage 1 here)

In condition A, we have many more cells with a weight of 0 and, since those are density plots, fewer cells with weights around .5 and 1. Visually, we can guess that cells in condition B differentiate preferentially along lineage 1.

## Testing for differential fate selection

To test for differential fate selection, we use the __fateSelectionTest__. The test can be run with `global = TRUE` to test when pooling all pairs of lineages, or `pairwise = TRUE` to test every pair independently, or both. Here, there is only one pair so the options are equivalent. Several tests are implemented in the __fateSelectionTest__. function. Here, we will use the default, the classifier test[@Lopez-Paz2016].

```{r}
set.seed(12)
dif_res <- fateSelectionTest(sds, conditions = df$conditions, global = FALSE, pairwise = TRUE)
knitr::kable(dif_res)
```

As could be guessed from the plot, we have clear differential fate selection.

# Differential Expression

The workflow above focus on global differences, looking at broad patterns of differentiation. While this is a necessary first step, gene-level information is also quite meaningful. 

To do so requires __tradeSeq__[@VandenBerge2020] > 1.3.0. Considering that we have a count matrix `counts`, the basic workflow is:s

```{r, eval = FALSE}
library(tradeSeq)
sce <- fitGAM(counts = counts, sds = sds, conditions = df$conditions)
cond_genes <- conditionTest(sds)
```

For more details on fitting the smoothers, we refer users to [the tradeSeq website](http://statomics.github.io/tradeSeq) and to the accompanying [Bioconductor workshop](https://kstreet13.github.io/bioc2020trajectories/articles/workshopTrajectories.html#differential-expression-1).

# Conclusion

For both of the above procedures, it is important to note that we are making multiple comparisons (in this case, 5). The p-values we obtain from these tests should be corrected for multiple testing, especially for trajectories with a large number of lineages.

That said, trajectory inference is often one of the last computational methods in a very long analysis pipeline (generally including gene-level quantification, gene filtering / feature selection, and dimensionality reduction). Hence, we strongly discourage the reader from putting too much faith in any p-value that comes out of this analysis. Such values may be useful suggestions, indicating particular features or cells for follow-up study, but should generally not be treated as meaningful statistical quantities.

# Session Info

```{r}
sessionInfo()
```

# References