---
title: "Making comparisons for differential abundance using contrasts"
author: "Mike Morgan"
date: "27/01/2022"
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
package: miloR
vignette: |
  %\VignetteIndexEntry{Using contrasts for differential abundance testing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = FALSE
)
```


```{r setup, message=FALSE, warning=FALSE}
library(miloR)
library(SingleCellExperiment)
library(scater)
library(scran)
library(dplyr)
library(patchwork)
library(MouseThymusAgeing)
library(scuttle)
```


# Introduction

We have seen how Milo uses graph neighbourhoods to model cell state abundance differences in an experiment, when comparing 2 groups. However, we are often interested in 
testing between 2 specific groups in our analysis when our experiment has collected data from $\gt$  2 groups. We can focus our analysis to a 2 group comparison and 
still make use of all of the data for things like dispersion estimation, by using _contrasts_. For an in-depth use of contrasts we recommend users refer to the `limma` 
and `edgeR` Biostars and Bioconductor community forum threads on the subject. Here I will give an overview of how to use contrasts in the context of a Milo analysis.

# Load data

We will use the `MouseThymusAgeing` data package as there are multiple groups that we can compare.

```{r}
thy.sce <- MouseSMARTseqData() # this function downloads the full SCE object
thy.sce <- logNormCounts(thy.sce)
thy.sce
```

# Define cell neighbourhoods

```{r, fig.height=4.1, fig.width=10.5}
thy.sce <- runUMAP(thy.sce) # add a UMAP for plotting results later

thy.milo <- Milo(thy.sce) # from the SCE object
reducedDim(thy.milo, "UMAP") <- reducedDim(thy.sce, "UMAP")

plotUMAP(thy.milo, colour_by="SubType") + plotUMAP(thy.milo, colour_by="Age")
```

These UMAPs shows how the different thymic epithelial cell subtypes and cells from different aged mice are distributed across our single-cell data set. Next 
we build the KNN graph and define neighbourhoods to quantify cell abundance across our experimental samples.

```{r}
# we build KNN graph
thy.milo <- buildGraph(thy.milo, k = 11, d = 40)
thy.milo <- makeNhoods(thy.milo, prop = 0.2, k = 11, d=40, refined = TRUE, refinement_scheme="graph") # make nhoods using graph-only as this is faster
colData(thy.milo)$Sample <- paste(colData(thy.milo)$SortDay, colData(thy.milo)$Age, sep="_")
thy.milo <- countCells(thy.milo, meta.data = data.frame(colData(thy.milo)), samples="Sample") # make the nhood X sample counts matrix
``` 


```{r}
plotNhoodSizeHist(thy.milo)
```

# Differential abundance testing with contrasts

Now we have the pieces in place for DA testing to demonstrate how to use contrasts. We will use these contrasts to explicitly define which groups will be 
compared to each other.

```{r}
thy.design <- data.frame(colData(thy.milo))[,c("Sample", "SortDay", "Age")]
thy.design <- distinct(thy.design)
rownames(thy.design) <- thy.design$Sample
## Reorder rownames to match columns of nhoodCounts(milo)
thy.design <- thy.design[colnames(nhoodCounts(thy.milo)), , drop=FALSE]
table(thy.design$Age)
```

To demonstrate the use of contrasts we will fit the whole model to the whole data set, but we will compare sequential pairs of time points. I'll start with week 1 vs. 
week 4 to illustrate the syntax.

```{r}
rownames(thy.design) <- thy.design$Sample
contrast.1 <- c("Age1wk - Age4wk") # the syntax is <VariableName><ConditionLevel> - <VariableName><ControlLevel>

# we need to use the ~ 0 + Variable expression here so that we have all of the levels of our variable as separate columns in our model matrix
da_results <- testNhoods(thy.milo, design = ~ 0 + Age, design.df = thy.design, model.contrasts = contrast.1,
                         fdr.weighting="graph-overlap", norm.method="TMM")
table(da_results$SpatialFDR < 0.1)
```

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions for 
`r sum(da_results$SpatialFDR < 0.1)` neighbourhoods.

You will notice that the syntax for the contrasts is quite specific. It starts with the name of the column variable that contains the different group levels; in this case 
it is the `Age` variable. We then define the comparison levels as `level1 - level2`. To understand this syntax we need to consider what we are concretely comparing. In this 
case we are asking what is the ratio of the average cell count at week1 compared to the average cell count at week 4, where the averaging is across the replicates. The 
reason we express this as a difference rather than a ratio is because we are dealing with the _log_ fold change.

We can also pass multiple comparisons at the same time, for instance if we wished to compare each sequential pair of time points. This will give us a better intuition behind 
how to use contrasts to compare multiple groups.

```{r}
contrast.all <- c("Age1wk - Age4wk", "Age4wk - Age16wk", "Age16wk - Age32wk", "Age32wk - Age52wk")
# contrast.all <- c("Age1wk - Age4wk", "Age16wk - Age32wk")

# this is the edgeR code called by `testNhoods`
model <- model.matrix(~ 0 + Age, data=thy.design)
mod.constrast <- makeContrasts(contrasts=contrast.all, levels=model)

mod.constrast
```

This shows the contrast matrix. If we want to test each of these comparisons then we need to pass them sequentially to `testNhoods`, then apply an additional 
multiple testing correction to the spatial FDR values.

```{r}
contrast1.res <- testNhoods(thy.milo, design=~0+ Age, design.df=thy.design, fdr.weighting="graph-overlap", model.contrasts = contrast.all)
head(contrast1.res)
```

This matrix of contrasts will perform a quasi-likelihood F-test over all 5 contrasts, hence a single p-value and spatial FDR are returned. Log fold changes are returned for 
each contrast of the `Age` variable, which gives 1 log-fold change column for each - this is the default behaviour of `glmQLFTest` in the `edgeR` package 
which is what Milo uses for hypothesis testing. In general, and to avoid confusion, we recommend testing each pair of contrasts separately if these are the comparisons 
of interest, as shown below.

```{r}
# compare weeks 4 and 16, with week 4 as the reference.
cont.4vs16.res <- testNhoods(thy.milo, design=~0+ Age, design.df=thy.design, fdr.weighting="graph-overlap", model.contrasts = c("Age4wk - Age16wk"))
head(cont.4vs16.res)
```

Now we have a single logFC which compares nhood abundance between week 4 and week 16 - as we can see the LFC estimates should be the same, but the SpatialFDR will be different.

```{r, fig.height=4, fig.width=7.5}
par(mfrow=c(1, 2))
plot(contrast1.res$logFC.Age4wk...Age16wk, cont.4vs16.res$logFC,
     xlab="4wk vs. 16wk LFC\nsingle contrast", ylab="4wk vs. 16wk LFC\nmultiple contrast")

plot(contrast1.res$SpatialFDR, cont.4vs16.res$SpatialFDR,
     xlab="Spatial FDR\nsingle contrast", ylab="Spatial FDR\nmultiple contrast")

```


Contrasts are not limited to these simple pair-wise comparisons, we can also group levels together for comparisons. For instance, imagine we want to know 
what the effect of the cell counts in the week 1 mice is _compared to all other time points_.

```{r}
model <- model.matrix(~ 0 + Age, data=thy.design)
ave.contrast <- c("Age1wk - (Age4wk + Age16wk + Age32wk + Age52wk)/4")
mod.constrast <- makeContrasts(contrasts=ave.contrast, levels=model)

mod.constrast
```
In this contrasts matrix we can see that we have taken the average effect over the other time points. Now running this using `testNhoods`

```{r}
da_results <- testNhoods(thy.milo, design = ~ 0 + Age, design.df = thy.design, model.contrasts = ave.contrast, fdr.weighting="graph-overlap")
table(da_results$SpatialFDR < 0.1)
head(da_results)
```

The results table In this comparison there are `r sum(da_results$SpatialFDR < 0.1)` DA nhoods - which we can visualise on a superimposed single-cell UMAP.

```{r, fig.width=10, fig.height=4.5}
thy.milo <- buildNhoodGraph(thy.milo)

plotUMAP(thy.milo, colour_by="SubType") + plotNhoodGraphDA(thy.milo, da_results, alpha=0.1) +
  plot_layout(guides="auto" )
```

In these side-by-side UMAPs we can see that there is an enrichment of the Perinatal cTEC and Proliferating TEC populations in the 1 week old compared to 
the other time points.

For a more extensive description of the uses of contrasts please take a look at the edgeR documentation \Biocpkg{edgeR}.


<details>
  <summary>**Session Info**</summary>
  
```{r}
sessionInfo()
```

</details>