---
title: "Decoding T- and B-cell receptor repertoires with ClustIRR"
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{Decoding T- and B-cell receptor repertoires with ClustIRR}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include=FALSE}
knitr::opts_chunk$set(comment = "", warning = FALSE, message = FALSE)
```

```{r}
library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
theme_set(new = theme_bw(base_size = 10))
library(ggrepel)
library(patchwork)
options(digits=2)
```

# Introduction

Adaptive immunity employs diverse immune receptor repertoires (IRRs; B-
and T-cell receptors) to combat evolving pathogens including viruses,
bacteria, and cancers. Receptor diversity arises through V(D)J
recombination - combinatorial assembly of germline genes generating
unique sequences. Each naive lymphocyte consequently expresses a
distinct receptor, enabling broad antigen recognition.

B cells engage antigens directly via BCR complementarity-determining
regions (CDRs), while T cells recognize peptide-MHC complexes through
TCR CDRs. Antigen recognition triggers clonal expansion, producing
effector populations that mount targeted immune responses.

High-throughput sequencing (HT-seq) enables tracking of TCR/BCR
community dynamics across biological conditions (e.g.,
pre-/post-treatment), offering insights into responses to immunotherapy
and vaccination. However, two key challenges complicate this approach:

1.  **Extreme diversity and privacy**: TCRs/BCRs are highly
    personalized, with incomplete sampling particularly problematic in
    clinical settings where sample volumes are limited. Even
    comprehensive sampling reveals minimal repertoire overlap between
    individuals.

2.  **Similar TCRs/BCRs recognize similar antigens**

This vignette introduces `r Biocpkg("ClustIRR")`, a computational method
that addresses these challenges by: (1) Identifying
specificity-associated receptor communities through sequence clustering,
and (2) Applying Bayesian models to quantify differential community
abundance across conditions.

# Installation

`r Biocpkg("ClustIRR")` is freely available as part of Bioconductor,
filling the gap that currently exists in terms of software for
quantitative analysis of IRRs.

To install `r Biocpkg("ClustIRR")` please start R and enter:

```{r, eval=FALSE}
if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")
```

# ClustIRR algorithm

```{r graphic, echo = FALSE, fig.align="left", out.width='100%'}
knitr::include_graphics("../inst/extdata/logo.png")
```

## Input

The main input of `r Biocpkg("ClustIRR")` is a repertoire (`s`), which
should be provided as data.frame. The rows in the data.frame correspond
to **clones** (clone = group of cells derived from a common parent cell
by clonal expansion). We use the following data from each clone:

-   **CDR3 amino acid sequences** from one/both chains (e.g.
    CDR3$\alpha$ and CDR3$\beta$ from TCR$\alpha\beta$s).
-   **Clone size**, which refers to the frequency of cells that belong
    to the clone.

In a typical scenario, the user will have more than one repertoire (see
workflow). For instance, the user will analyze longitudinal repertoire
data, i.e., two or three repertoires taken at different time points; or
across different biological conditions.

Let's look at dataset `D1` that is provided within
`r Biocpkg("ClustIRR")`. `D1` contains three TCR$\alpha\beta$
repertoires: $a$, $b$, and $c$ and their metadata: `ma`, `mb` and `mc`.

```{r}
data("D1", package = "ClustIRR")
str(D1)
```

Extract the data.frames for each TCR repertoire and their metadata:

We will merge three TCR repertoires into the data.frame tcr_reps.

```{r}
tcr_reps <- rbind(D1$a, D1$b, D1$c)
```

We will do the same for the metadata

```{r}
meta <- rbind(D1$ma, D1$mb, D1$mc)
```

## Algorithm

`r Biocpkg("ClustIRR")` performs the following steps.

1.  Compute similarities between T-cell clones within each TCR
    repertoire

2.  Construct a graph from each TCR repertoire

3.  Construct a joint similarity graph ($J$)

4.  Detect communities in $J$

5.  Analyze Differential Community Occupancy (DCO)

    a\. Between individual TCR repertoires with model $M$

    b\. Between groups of TCR repertoires from biological conditions
    with model $M_h$

6.  Inspect results

## **Step 1.** Compute TCR clone similarities in a repertoire with `cluster_irr`

`r Biocpkg("ClustIRR")` aims to quantify the similarity between pairs of
TCR clones based on the similarities of their CDR3s sequences. For this
it employs Basic Local-Alignment Search Tool (BLAST) via the R-package
`r CRANpkg("blaster")`. Briefly, a protein database is constructed from
all CDR3 sequences, and each CDR3 sequence is used as a query. This
enables fast sequence similarity searches. Furthermore, only CDR3
sequences matches with $\geq$ 70% sequence identity to the query are
retained. This step reduces the computational and memory requirements,
without impacting downstream community analyses, as CDR3 sequences with
lower typically yield low similarity scores.

For matched CDR3 pair, an alignment score ($\omega$) is computed using
BLOSUM62 substitution scores with gap opening penalty of -10 and gap
extension penalty of -4. $\omega$ is the sum of substitution scores and
gap penalties in the alignment. Identical or highly similar CDR3
sequence pairs receive large positive $\omega$ scores, while dissimilar
pairs receive low or negative $\omega$. To normalize $\omega$ for
alignment length, `r Biocpkg("ClustIRR")` computes
$\bar{\omega} = \omega/l$, where $l$ is the alignment length yielding
normalized alignment score $\bar{\omega}$. This normalization, also used
in iSMART (Zhang, 2020), ensures comparability across CDR3 pairs of
varying lengths.

`r Biocpkg("ClustIRR")` also computes alignment scores for the CDR3
**core** regions ($\omega^c$ and $\bar{\omega}^c$). The CDR3 core,
representing the central loop region with high antigen-contact
probability (Glanville, 2017), is generated by trimming three residues
from each end of the CDR3 sequence. Comparing $\bar{\omega}^c$ and
$\bar{\omega}$ allows assessment of whether sequence similarity is
concentrated in the core or flanking regions.

## **Step 2-3.** Construct TCR repertoire graphs and join them into $J$

Next, `r Biocpkg("ClustIRR")` builds a graph for each TCR repertoire.
The graphs have *nodes* and *weighted edges*:

-   nodes: clones from each TCR repertoire. Each clone attribute (clone
    size, CDR3 sequences, etc.) is provided as node attribute
-   edges: undirected edges connecting pairs of nodes based on
    CDR3$\alpha$ and CDR3$\beta$ similarity scores ($\bar{\omega}$ and
    $\bar{\omega}^c$) in each TCR repertoire (computed in step 1.)

Then the graphs are joined together: edges between TCR clones from
different TCR repertoires are computed using the same procedure outlined
in step 1. The joint graph $J$ is stored as an `igraph` object.

## Run steps 1-3 with `clustirr`

Step 1. involves calling the function `clust_irr` which returns an S4
object of class `clust_irr`.

```{r}
cl <- clustirr(s = tcr_reps, meta = meta, control = list(gmi = 0.7))
```

The output **complex** list with three elements:

1.  `graph` = the joint graph $J$

2.  `clust_irrs` = list with S4 `clust_irr` objects one for each TCR
    repertoire

3.  `multigraph` = logical value which tell us whether $J$ contains
    multiple or a single TCR repertoire

### Inspect the content of `clust_irrs`

We can look at the similarity scores between CDR3 sequences of TCR
repertoire `a`. Each row is a pair of CDR3 sequences from the repertoire
annotated with the following metrics:

-   `max_len`: length of the longer CDR3 sequence in the pair
-   `max_clen`: length of the longer CDR3 core sequence in the pair
-   `weight`: $\omega$ = BLOSUM62 score of the **complete** CDR3
    alignment
-   `cweight`= $\omega_c$: BLOSUM62 score of CDR3 **cores**
-   `nweight` = $\bar{\omega}$: normalized `weight` by `max_len`
-   `ncweight` = $\bar{\omega}_c$: normalized `cweight` by `max_clen`

The results for CDR3$\alpha$ sequence pairs from repertoire `a`:

```{r}
kable(head(cl$clust_irrs[["a"]]@clust$CDR3a), digits = 2)
```

... the same table for CDR3$\beta$ sequence pairs from repertoire `a`:

```{r}
kable(head(cl$clust_irrs[["a"]]@clust$CDR3b), digits = 2)
```

Notice that very similar CDR3$\alpha$ or CDR3$\beta$ pairs have high
normalized alignment scores ($\bar{\omega}$). We have similar tables for
repertoire `b` and `c`. These are used internally to construct graphs in
the next step.

### Inspect graphs with `plot_graph`

We can use the function `plot_graph` for interactive visualization of
relatively small graphs.

The function `clust_irr` performs automatic annotation of TCR clones
based on multiple databases (DBs), including: VDJdb, TCR3d, McPAS-TCR.
Each TCR clone recieves two types of annotations (per chain and per
database):

-   **Ag_species**: the name of the antigen species recognized by the
    clone's CDR3

-   **Ag_gene**: the name of the antigen gene recognized by clone's CDR3

Hence, in the plot we can highlight TCR clones that recognize certain
antigenic species or genes (see dropdown menu), or use the hoovering
function to look at the CDR3 sequences of nodes.

Do this now!

```{r}
plot_graph(cl, select_by = "Ag_species", as_visnet = TRUE)
```

Notice that even in this small case study--where each TCR repertoire
contains **only 500 TCR clones**--we observe that the **joint graph is
complex**! This complexity makes qualitative analyses impractical. To
address this, `r Biocpkg("ClustIRR")` focuses on **quantitative
analyses** (see the next steps).

### You can evaluate $J$ with igraph

We can use igraph functions to inspect various properties of the graph
in `cl$graph`. For instance, below, we extract the edge attributes and
visualize the distributions of the edge attributes `ncweight` and
`nweight` for all CDR3$\alpha$ and CDR3$\beta$ sequence pairs.

```{r, fig.width=6, fig.height=4.5}
# data.frame of edges and their attributes
e <- igraph::as_data_frame(x = cl$graph, what = "edges")
```

```{r, fig.width=5, fig.height=3.5}
ggplot(data = e)+
  geom_density(aes(ncweight, col = chain))+
  geom_density(aes(nweight, col = chain), linetype = "dashed")+
  xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
  theme(legend.position = "top")
```

Can you guess why we observe **trimodal** distributions?

-   *Top mode: weights of identical CDR3 sequence pairs from the
    different TCR repertoires*

-   *Middle mode: weights of CDR3 sequences with same lengths*

-   *Bottom mode: weights of CDR3 sequences with different lengths -\>
    scores penalized by gap cost*

## **Step 4.** community detection with `detect_communities`

`r Biocpkg("ClustIRR")` employs graph-based community detection (GCD)
algorithms, such as Louvain, Leiden or InfoMap, to identify
**communities** of nodes that have high density of edges among each
other, and low density of edges with nodes outside the community.

First, the similarity score between T-cell clones $i$ and $j$ is defined
as the average CDR3$\alpha$ and CDR3$\beta$ alignment scores:

```{=tex}
\begin{align}
\omega(i,j) = \dfrac{{\bar{\omega}}_\alpha + {\bar{\omega}}_\beta}{2}
\end{align}
```
where $\bar{\omega}_\alpha$ and $\bar{\omega}_\beta$ are the alignment
scores for the CDR3$\alpha$ and CDR3$\beta$, respectively. If a chain is
missing, its alignment score is set to 0.

The user has the following options:

-   `algorithm`: "leiden" (default) "louvain", or "infomap"
-   `resolution`: GCD resolution = 1 (default)
-   `iterations`: number of clustering iterations (default = 100) to
    ensure robust results
-   `weight`: "ncweight" (default) or "nweight"
-   `chains`: "CDR3a" or "CDR3b" or c("CDR3a", "CDR3b")

```{r}
gcd <- detect_communities(graph = cl$graph, 
                          algorithm = "leiden",
                          metric = "average",
                          resolution = 1,
                          iterations = 100,
                          weight = "ncweight",
                          chains = c("CDR3a", "CDR3b"))
```

### Inspecting the outputs of `detect_communities`

The function `detect_communities` generates a complex output. Lets
investigate its elements:

```{r}
names(gcd)
```

The main element is `community_occupancy_matrix`, which contains the
number of T-cells in each community (row) and repertoire (column). Here
we have three repertoires (three columns) and about 260 communities.
This matrix is the main input of the function `dco` (step 5.), to detect
differences in the community occupancy between repertoires.

```{r}
dim(gcd$community_occupancy_matrix)
```

```{r}
head(gcd$community_occupancy_matrix)
```

### Visualizing community abundance matrices with `get_honeycombs`

```{r, fig.width=5, fig.height=5}
honeycomb <- get_honeycombs(com = gcd$community_occupancy_matrix)
```

In the **honeycomb plots** shown below, several communities (black dots)
appear **far** from the diagonal. This indicates that these communities
contain more cells in repertoires b and c (y-axes in panels A and B)
compared to repertoire a (x-axes in panels A and B). Meanwhile, the same
points are generally close to the diagonal in panel C but remain
slightly more abundant in repertoire c (y-axis) compared to b (x-axis).

In step 5, `r Biocpkg("ClustIRR")` will provide a quantitative answer to
the question: *Which communities are differentially abundant between
pairs of repertoires?*

Importantly, the color of the hexagons encodes the density of
communities in the 2D scatterplots: **dark hexagons** indicate high a
frequency of communities, while **light hexagons** represent sparsely
populated regions.

```{r, fig.width=10, fig.height=2.5}
wrap_plots(honeycomb, nrow = 1)+
    plot_annotation(tag_levels = 'A')
```

Also see `community_summary`. In the data.frame `wide` we provide
community summaries in each community across all TCR repertoires,
including:

-   `clones_a`, `clone_b`, `clone_c`, `clones_n`: the frequency of
    clones in the community coming from repertoire `a`, `b`, `c` and in
    total (n)
-   `cells_a`, `cells_b`, `cells_c`, `cells_n`: the frequency of cell in
    the community coming from repertoires `a`, `b`, `c` and in total (n)
-   `w`: average inter-clone similarity
-   `ncweight_CDR3a/b`, `nweight_CDR3a/b`: raw and normalized similarity
    for CDR3$\alpha$ and CDR3$\beta$ sequences
-   `n_CDR3a`, `n_CDR3b`: number of edges between CDR3$\alpha$ and
    CDR3$\beta$ sequences

```{r}
kable(head(gcd$community_summary$wide), digits = 1, row.names = FALSE)
```

In the data.frame `tall` we provide community and repertoire summaries
in each row.

```{r}
kable(head(gcd$community_summary$tall), digits = 1, row.names = FALSE)
```

Node-specific (TCR clone-specific) summaries are provided in
`node_summary`

```{r}
kable(head(gcd$node_summary), digits = 1, row.names = FALSE)
```

### Special functions: decoding communities with `decode_community`

Community with ID=8 contains 42 TCR clones. These are connected based on
their CDR3$\beta$ sequences.

```{r}
ns_com_8 <- gcd$node_summary[gcd$node_summary$community == 8, ]

table(ns_com_8$sample)
```

We can extract it and visualize it using igraph:

```{r, fig.width=6, fig.height=4}
par(mai = c(0,0,0,0))
plot(subgraph(graph = gcd$graph, vids = which(V(gcd$graph)$community == 8)))
```

Furthermore, we can "decompose" this graph using `decode_community`
based on the attributes of the edges (`edge_filter`) and nodes
(`node_filter`).

```{r}
# we can pick from these edge attributes
edge_attr_names(graph = gcd$graph)
```

The following edge-filter will instruct ClustIRR to keep edges with with
edge attributes: nweight $>=$ 8 **AND** ncweight $>=$ 8

```{r}
edge_filter <- rbind(data.frame(name = "nweight", value = 8, operation = ">="),
                     data.frame(name = "ncweight", value = 8, operation = ">="))
```

```{r}
# we can pick from these node attributes
vertex_attr_names(graph = gcd$graph)
```

The following node-filter will instruct ClustIRR to retain edges between
nodes that have shared node attributed with respect to **ALL** of the
following node attributes:

```{r}
node_filter <- data.frame(name = c("TRBV", "TRAV"))
```

```{r}
c8 <- decode_communities(community_id = 8, 
                         graph = gcd$graph,
                         edge_filter = edge_filter,
                         node_filter = node_filter)
```

```{r, fig.width=6, fig.height=4}
par(mar = c(0, 0, 0, 0))
plot(c8$community, vertex.size = 10)
```

```{r}
kable(as_data_frame(x = c8$community, what = "vertices")
      [, c("name", "component_id", "CDR3b", "TRBV", 
           "TRBJ", "CDR3a", "TRAV", "TRAJ")],
      row.names = FALSE)
```

... or we can summarize the properties of each component in the next
table as rows with:

-   component id, community id (=8 in this example because this is what
    we selected)
-   mean edge weights (for all core and complete CDR3 pairs)
-   number of nodes, number of edges and expected number edges if the
    component were a clique
-   diameter of the component

```{r}
kable(c8$component_stats, row.names = FALSE)
```

## **Step 5.** differential community occupancy (DCO) with `dco`

Do we see **expanding** or **contracting** communities in a given
repertoire?

We employ a Bayesian model to quantify the relative abundance
(occupancy) of individual communities in each repertoire.

The model output for repertoire $a$ is the parameter vector
$\beta^a=\beta^a_1,\beta^a_2,\ldots,\beta^a_k$, which describes the
effect of repertoire $a$ on the relative occupancy in each community.

Given two repertoires, $a$ and $b$, we can quantify the differential
community occupancy (DCO):

```{=tex}
\begin{align}
\delta^{a-b}_i = \beta^a_i - \beta^b_i
\end{align}
```
```{r}
d <- dco(community_occupancy_matrix = gcd$community_occupancy_matrix,
         mcmc_control = list(mcmc_warmup = 300,
                             mcmc_iter = 600,
                             mcmc_chains = 2,
                             mcmc_cores = 1,
                             mcmc_algorithm = "NUTS",
                             adapt_delta = 0.9,
                             max_treedepth = 10))
```

## **Step 6.** Inspect results

### Visualizing the distribution of $\beta$ with `get_beta_violins`

```{r}
beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = NULL,
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")
```

```{r, fig.width=5, fig.height=3}
beta_violins$violins
```

### Compare $\beta$s of clones specific for *CMV*, *EBV*, *flu* or *MLANA*?

```{r}
beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = c("CMV", "EBV", "Influenza"),
                                 ag_genes = c("MLANA"),
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")
```

```{r, fig.width=9, fig.height=7}
patchwork::wrap_plots(beta_violins$violins, ncol = 2)
```

### Compare $\beta$ between repertoires with `get_beta_scatterplot`

```{r}
beta_scatterplots <- get_beta_scatterplot(node_summary = gcd$node_summary,
                                          beta = d$posterior_summary$beta,
                                          ag_species = c("CMV"),
                                          db = "vdjdb",
                                          db_dist = 0,
                                          chain = "both")
```

```{r, fig.width=14, fig.height=10}
patchwork::wrap_plots(beta_scatterplots$scatterplots[[1]], ncol = 3)
```

### Posterior predictive checks

Before we can start interpreting the model results, we have to make sure
that the model is valid. One standard approach is to check whether our
model can retrodict the observed data (community occupancy matrix) which
was used to fit model parameters.

General idea of posterior predictive checks:

1.  fit model based on data $y$
2.  simulate new data $\hat{y}$
3.  compare $y$ and $\hat{y}$

`r Biocpkg("ClustIRR")` provides $y$ and $\hat{y}$ of each repertoire,
which we can visualize with ggplot:

```{r, fig.width=6, fig.height=2.5}
ggplot(data = d$posterior_summary$y_hat)+
  facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
                col = "darkgray", width=0)+
  geom_point(aes(x = y_obs, y = mean), size = 0.8)+
  xlab(label = "observed y")+
  ylab(label = "predicted y (and 95% HDI)")
```

### Differential community abundance results $\rightarrow$ par. $\delta$

Given two repertoires, $a$ and $b$, we can quantify the differential
community occupancy (DCO):

```{=tex}
\begin{align}
\delta^{a-b}_i = \beta^a_i - \beta^b_i
\end{align}
```
Importantly, `r Biocpkg("ClustIRR")` always computes both contrasts ($a$
vs. $b$ and $b$ vs. $a$): $\delta^{a-b}_i$ and $\delta^{b-a}_i$.

```{r, fig.width=7, fig.height=6}
ggplot(data = d$posterior_summary$delta)+
    facet_wrap(facets = ~contrast, ncol = 2)+
    geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), 
                  col = "lightgray", width = 0)+
    geom_point(aes(x = community, y = mean), shape = 21, fill = "black", 
               stroke = 0.4, col = "white", size = 1.25)+
    theme(legend.position = "top")+
    ylab(label = expression(delta))+
    scale_x_continuous(expand = c(0,0))
```

Given two repertoires, $a$ and $b$, `r Biocpkg("ClustIRR")` also
quantifies absolute differences in community probabilities:

```{=tex}
\begin{align}
\epsilon^{a-b} = \mathrm{softmax}(\alpha + \beta^a) - 
\mathrm{softmax}(\alpha + \beta^b)
\end{align}
```
Again, both contrasts are computed: $\epsilon^{a-b}_i$ and
$\epsilon^{b-a}_i$.

```{r, fig.width=7, fig.height=6}
ggplot(data = d$posterior_summary$epsilon)+
    facet_wrap(facets = ~contrast, ncol = 2)+
    geom_errorbar(aes(x = community, y = mean, ymin = L95, ymax = H95), 
                  col = "lightgray", width = 0)+
    geom_point(aes(x = community, y = mean), shape = 21, fill = "black", 
               stroke = 0.4, col = "white", size = 1.25)+
    theme(legend.position = "top")+
    ylab(label = expression(epsilon))+
    scale_x_continuous(expand = c(0,0))
```

```{r, echo=FALSE, include=FALSE}
rm(a, b, cl_a, cl_b, cl_c, meta_a, meta_b, meta_c, d, e, g, gcd)
```

## Conclusion: you can also use **custom** community occupancy matrix for DCO!

The function `dco` takes as its main input a community occupancy matrix.
This enables users who are accustomed to using complementary algorithm
for detecting specificity groups, such as, GLIPH, TCRdist3, GIANA, and
iSMART, to skip steps 1-4 of the `r Biocpkg("ClustIRR")` workflow, and
to proceed with analysis for DCO.