---
title: "User Manual: IgGeneUsage"
author: "SK"
date: "June 25, 2019"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{User Manual: IgGeneUsage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```


```{r}
require(IgGeneUsage)
require(knitr)
require(ggplot2)
require(ggforce)
require(gridExtra)
require(ggrepel)
require(rstan)
require(reshape2)
rstan_options(auto_write = TRUE)
```


# Introduction
Decoding the properties of immune repertoires is key in understanding the 
response of adaptive immunity to challenges such as viral infection. One 
important property is biases in Ig gene usage between biological conditions 
(e.g. healthy vs turmor). Yet, most analyses for differential gene usage are
performed qualitatively, or with inadequate statistical methods. Here we 
introduce IgGeneUsage, a computational tool for the analysis of differential 
gene usage.


# Input
The input to IgGeneUsage is a data.frame object with usage frequencies for each 
gene of a repertoire that belongs to a particular biological condition. The
usage data.frame has the following 4 columns:

  1. sample_id: identifier of the repertoire (e.g. Patient-1)
  2. condition: identifier of the condition to which each repertoire 
  belongs (e.g. healthy or tumor)
  3. gene_name: specific gene name (e.g. IGHV1-10 or family TRVB1)
  4. gene_usage_count: numeric (count) of usage related to columns 1-3

The sum of all gene usage counts (column 4) for a given repertoire should be 
equal to the total gene usage in that repertoire. 


# Method
IgGeneUsage transforms the provided input in the following way. Given $R$ 
repertoires, each having $G$ genes, IgGeneUsage generates a gene usage matrix 
$Y^{R \times G}$. Row sums in $Y$ define the total usage in each repertoire 
($N$). The design variable $X$ is set to $X = 1$ for repertoires that belong to 
the first condition, and $X = -1$ otherwise.

For the analysis of DGU between two biological conditions, we designed the 
following Bayesian model ($M$) for zero-inflated beta-binomial regression. This
model can fit over-dispersed gene usage data. The immune repertoire data is also
not exhaustive, which leads to misdetection of genes that are systematically 
rearranged at low probability. The zero-inflated component of our model 
accounts for this:

\begin{align}
p(Y_{ij} \mid M) = \begin{cases} 
\kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ = 0} \\
(1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ > 0}
\end{cases}\\
\theta_{ij} = \operatorname{logit^{-1}}\left(\alpha_{j}+\beta_{ij}X_{i}\right)\\
\beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\
\gamma_{j}\sim\operatorname{Normal}\left(\hat{\gamma},\hat{\gamma}_{\sigma}
\right) \\
\alpha_{j}\sim\operatorname{Normal}\left(\hat{\alpha},\hat{\alpha}_{\sigma}
\right) \\
\hat{\gamma} \sim \operatorname{Normal}\left(0, 5\right) \\
\hat{\alpha} \sim \operatorname{Normal}\left(0, 10\right) \\
\gamma_{\sigma}, \hat{\gamma}_{\sigma}, \hat{\alpha}_{\sigma} \sim 
\operatorname{Cauchy^{+}}\left(0, 1\right) \\
\phi \sim \operatorname{Exponential}\left(\tau\right) \\
\tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\
\kappa \sim \operatorname{Beta}\left(1, 3\right)
\end{align}

Model $M$ legend:

  * $i$ and $j$: index of different repertoires and genes, respectively 
  * $\kappa$: zero-inflation probability
  * $\theta$: probability of gene usage
  * $\phi$: dispersion
  * $\alpha$: intercept/baseline gene usage
  * $\beta$: slope/within-repertoire DGU coefficient
  * $\gamma$, $\gamma_{\sigma}$: slope/gene-specific DGU coefficient; 
  standard deviation
  * $\hat{\gamma}$, $\hat{\gamma}_{\sigma}$: mean and standard deviation of 
  the population of gene-specific DGU coefficients
  * $\hat{\alpha}$, $\hat{\alpha}_{\sigma}$: mean and standard deviation of 
  the population of gene-specific baseline usages
  * $\operatorname{BB}$: beta-binomial probability mass function (pmf)
  * $\operatorname{Normal}$: normal probability density function (pdf)
  * $\operatorname{Cauchy^{+}}$: half-Cauchy pdf
  * $\operatorname{Exponential}$: exponential pdf
  * $\operatorname{Gamma}$: gamma pdf
  * $\operatorname{Beta}$: beta pdf
  * $\operatorname{logit^{-1}}$: inverse logistic function

In the output of IgGeneUsage, we report the mean effect size ($\gamma$) and its 
95% highest density interval (HDI). Genes with $\gamma \neq 0$ (e.g. if 95% HDI 
of $\gamma$ excludes 0) are most likely to experience differential usage. 
Additionally, we report the probability of differential gene usage ($\pi$):
\begin{align}
\pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, 
\int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1
\end{align}
with $\pi = 1$ for genes with strong differential usage, and $\pi = 0$ for 
genes with negligible differential gene usage. Both metrics are computed based
on the posterior distribution of $\gamma$, and are thus related. We find $\pi$ 
slightly easier to interpret.


# Case Study I
IgGeneUsage provides built-in datasets from studies that evaluate biases in Ig
gene usage. The dataset IGHV_HCV contains publicly available data of human 
immunoglobulin heavy chain VDJ rearrangements from a study that evaluates the 
effect of HCV infection on the human BCR repertoire[^1]. The dataset consists 
of a population of class-switched memory (CSM) B cells of 22 HCV-infected 
patients (HCV+) and 7 healthy donors (HD). Gene usage data is available for 69 
IGHV gene segments.

[^1]: Tucci, Felicia A., et al. "Biased IGH VDJ gene repertoire and clonal 
expansions in B cells of chronically hepatitis C virus–infected individuals." 
Blood 131.5 (2018): 546-557.]


## Input data
The data is already formatted such that it can directly be passed as input to
IgGeneUsage. The following steps allow you to load the data, and to inspect 
its column names and table entries.

```{r}
data("IGHV_HCV", package = "IgGeneUsage")
```


```{r}
kable(x = head(IGHV_HCV), row.names = FALSE)
```


## Data visualization
Lets visualize the gene usage with ggplot2.

```{r, fig.height = 5, fig.width = 8}
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition, 
                         FUN = sum, data = IGHV_HCV)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL

# merge it with the original data
viz <- merge(x = IGHV_HCV, y = total.usage, 
             by = c("sample_id", "condition"), 
             all.x = TRUE)

# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100

# For this example lets only consider the top 45 genes
# Hint: In real analyses you should not filter any gene
top <- aggregate(gene_usage_pct~gene_name, data = viz, FUN = mean)
top <- top[order(top$gene_usage_pct, decreasing = TRUE), ]
IGHV_HCV <- IGHV_HCV[IGHV_HCV$gene_name %in% top$gene_name[seq_len(45)], ]
viz <- viz[viz$gene_name %in% top$gene_name[seq_len(45)], ]

# visualize
ggplot(data = viz)+
  geom_point(aes(x = gene_name, y = gene_usage_pct, 
                 fill = condition, shape = condition),
             position = position_dodge(width = .7), stroke = 0)+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [%]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_shape_manual(name = "Condition", values = c(21, 22))
```


## Analysis of differential gene usage with IgGeneUsage
This is the main method of IgGeneUsage. The primary input is the dataset 
IGHV_HCV. Other inputs allow you to configure specific settings of the Markov 
Chain Monte Carlo (MCMC) simulation.

In this example we analyze IGHV_HCV with 2 MCMC chains (750 iterations each, 
including 250 warm-ups) in parallel using 2 CPU cores. We compute for each 
parameter of our model its 95% highest density interval (HDIs).

Important remark: you should run your analysis using the default argument 
values of the function DGU. These values have been carefully chosen to fit most 
analyses for differential gene usage of immune repertoires. If any messages or 
warnings are reported concerning the MCMC sampling, please consult the Stan 
manual[^2] and adjust the MCMC arguments accordingly. If the warnings persist, 
please submit an issue with a reproducible script at the Bioconductor support 
site or on Github[^3].

```{r}
M <- DGU(usage.data = IGHV_HCV, # input data
         mcmc.warmup = 750, # how many MCMC warm-ups per chain (default: 500)
         mcmc.steps = 2000, # how many MCMC steps per chain (default: 1,500)
         mcmc.chains = 2, # how many MCMC chain to run (default: 4)
         mcmc.cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi.level = 0.95, # highest density interval level (de fault: 0.95)
         adapt.delta = 0.95, # MCMC target acceptance rate (default: 0.95)
         max.treedepth = 10) # tree depth evaluated at each step (default: 12)
```


## Output
The following objects are provided as part of the output of DGU:
  
  * glm.summary (main results of IgGeneUsage): quantitative summary of 
  differential gene usage  (see section 'Results')
  * test.summary: alternative quantitative summary of differential gene usage 
  estimated with two frequentist methods (the Welch's t-test and the Wilcoxon 
  signed-rank test)
  * glm: rstan ('stanfit') object of the fitted model $rightarrow$ you can use 
  this for further checks of your model (see section 'Model checking')
  * ppc.data: data on posterior predictive checks (see section 'Model checking')

```{r}
summary(M)
```


## Model checking
Do not blindly trust your model $\rightarrow$ check it extensively. You can
use the object glm for these checks.

  * Checklist of successful MCMC sampling[^2]:
      * no divergences, no excessive warnings from rstan, Rhat < 1.05, 
      high Neff
  * Checklist for valid model:
      * observed data can be replicated with the fitted model $\rightarrow$ 
      posterior predictive checks
      * leave-one-out analysis

[^2]: https://mc-stan.org/misc/warnings.html
[^3]: https://github.com/snaketron/IgGeneUsage/issues


### MCMC sampling

  * divergences, tree-depth, energy
  
```{r}
rstan::check_hmc_diagnostics(M$glm)
```

  * Rhat and Neff

```{r, fig.height = 3, fig.width = 6}
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
                        rstan::stan_ess(object = M$glm),
                        nrow = 1)
```


### Posterior predictive checks: repertoire-specific
IgGeneUsage has built-in checks for repertoire-specific posterior prediction 
of gene usage. It uses the fitted model to predict the usage of each gene in 
a given repertoire (y-axis). Error bars show 95% HDI of mean posterior 
prediction. The predictions can be compared with the observed data (x-axis). 
For points near the diagnoal $\rightarrow$ accurate prediction.

The following figure shows that our model can reproduce the fitted data 
$\rightarrow$ this is an indicator of a valid model.

```{r, fig.height = 10, fig.width = 8}
ggplot(data = M$ppc.data$ppc.repertoire)+
  facet_wrap(facets = ~sample_name, ncol = 5)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_x_log10(breaks = c(1, 10, 100, 1000), 
                labels = expression(10^0, 10^1, 10^2, 10^3))+
  scale_y_log10(breaks = c(1, 10, 100, 1000), 
                labels = expression(10^0, 10^1, 10^2, 10^3))+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "Predicted usage [counts]")+
  annotation_logticks(base = 10, sides = "lb")
```


### Posterior predictive checks: overall
IgGeneUsage has built-in checks for gene-specific posterior prediction of gene 
usage within each biological condition. We show the predictions (y-axis) of 
the model, and compare them against the observed mean usage (x-axis). If the 
points are near the diagnoal $\rightarrow$ accurate prediction. Errors show 
95% HDI of mean posterior prediction.

```{r, fig.height = 4, fig.width = 6}
ggplot(data = M$ppc.data$ppc.gene)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100), col = "darkgray")+
  geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "Predicted usage [%]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "lb")
```


## Results
Next we present the main results of IgGeneUsage contained in the data.frame 
glm.summary. Each row of glm.summary summarizes the degree of differential 
gene usage in a specific immune gene using two metrics (es and pmax).

Legend:
  
  * es: effect size ($\gamma$ from model $M$) on differential gene usage 
  (mean, median standard error (se), standard deviation (sd), L (low boundary 
  of 95% HDI), H (high boundary of 95% HDI))
  * contrast: the direction of the effect (e.g. tumor - healthy)
  * pmax: probability of differential gene usage ($\pi$ from model $M$)

```{r}
kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
```

The effect size and $\pi$ are related. Lets visualize them for each gene (shown 
as a point). Names are shown of genes with $\pi \geq 0.9$. Dashed horizontal 
line represents null-effect ($\gamma = 0$). Notice that the gene with $\pi 
\approx 1$ also has an effect size whose 95% HDI (error bar) does not overlap 
the null-effect. The genes with high degree of differential usage are easy to 
detect with this figure.

```{r, fig.height = 4, fig.width = 6}
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)

stats <- merge(x = stats, y = M$test.summary, by = "gene_name")

ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean))+
  geom_text_repel(data = stats[stats$pmax >= 0.9, ],
                  aes(x = pmax, y = es_mean, label = gene_fac))+
  theme_bw(base_size = 11)+
  xlab(label = expression(pi))
```


### Promising hits
Lets visualize the observed data of the genes with high probability of 
differential gene usage ($\pi \geq 0.9$). Here we show the gene usage in %.

```{r, fig.height = 3, fig.width = 6}
promising.genes <- stats$gene_name[stats$pmax >= 0.9]

ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]

ggplot()+
  geom_errorbar(data = ppc.gene, 
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, col = condition),
                position = position_dodge(width = .8), width = 0.75)+
  geom_point(data = viz[viz$gene_name %in% promising.genes, ],
             aes(x = gene_name, y = gene_usage_pct, col = condition),
             shape = 21, size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.15, 
                                             jitter.height = 0, 
                                             dodge.width = 0.8))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "Usage [%]")+
  xlab(label = '')
```


### Promising hits [count]
Lets also visualize the gene usage frequencies. Point size represents 
total usage in repertoire.

```{r, fig.height = 8, fig.width = 6}
promising.genes <- stats$gene_name[stats$pmax >= 0.9]

ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
  facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
  geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition, 
                 size = total/10^3), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```


## Comparison with the Welch's t-test (T-test)
Despite the fact that the data is not normaly distributed, we performed the
analysis of differential gene usage with the T-test. Prior to using the T-test, 
we must convert the usage frequencies into proportions. This is automatically
done by IgGeneUsage (object test.summary).

Next, we compare the probabilities of differential gene usage ($\pi$) with the 
FDR corrected P-values (-log10 scale) from the T-test. Dashed lines show 
significance levels of 0.05 and 0.01. We observe few disagreements between the 
two tests. Lets inspect them in more detail.

```{r, fig.height = 5, fig.width = 8}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(t.test.fdr.pvalue)))+
  geom_text_repel(data = stats[stats$pmax >= 0.5, ], 
                  aes(x = pmax, y = -log10(t.test.fdr.pvalue), 
                      label = gene_name), size = 4, 
                  min.segment.length = 0.1)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


### Outliers: IGHV1-58 and IGHV3-72
The two genes IGHV1-58 and IGHV3-72 are prioritized by the T-test. Their $\pi$ 
estimates are however far from 1. Violation of T-test's assumptions is the most
probable cause for the disagreement. Discarding information about the sample 
size, and thus about uncertainty could be an alternative explanation. Lets
visualize the data for IGHV1-58 and IGHV3-72.

```{r, fig.height = 3, fig.width = 6}
promising.genes <- c("IGHV1-58", "IGHV3-72")

ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]

ggplot()+
  geom_errorbar(data = ppc.gene, 
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, col = condition),
                position = position_dodge(width = .8), width = 0.75)+
  geom_point(data = viz[viz$gene_name %in% promising.genes, ],
             aes(x = gene_name, y = gene_usage_pct, col = condition),
             shape = 21, size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.15, 
                                             jitter.height = 0, 
                                             dodge.width = 0.8))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "Gene usage [%]")+
  xlab(label = '')
```


### Outliers: IGHV1-58 and IGHV3-72 [counts]

```{r, fig.height = 5, fig.width = 6}
promising.genes <- c("IGHV1-58", "IGHV3-72")

ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
  facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
  geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition, 
                 size = total/10^3), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```


## Comparison with the Wilcoxon signed-rank test (U-test)
The nonparametric U-test can also be used for the analysis of differential gene
usage. It assumes data with equal shape in both groups (also not met by our 
data). Prior to using the U-test, we must again convert the usage frequencies
into proportions. This is automatically done by IgGeneUsage.

Lets also compare $\pi$ with the FDR corrected P-values (-log10 scale) from 
the U-test. Dashed lines show significance levels of 0.05 and 0.01. The U-test 
finds no evidence of differential gene usage.

```{r, fig.height = 6, fig.width = 8}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(u.test.fdr.pvalue)))+
  geom_text_repel(data = stats[stats$pmax >= 0.5, ], 
                  aes(x = pmax, y = -log10(u.test.fdr.pvalue), 
                      label = gene_name), size = 4, 
                  min.segment.length = 0.1)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from U-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


# Case Study II
For our second case study we use a dataset from a study evaluating 
vaccine-induced changes in B-cell populations [^5]. The data is publicly 
provided via the R-package alakazam (version 0.2.11). The IGHV family usage is
reported in four B-cell populations (repertoires IgM, IgD, IgG and IgA) across 
two timepoints (conditions = -1 hour vs. +7 days). In this case study we 
investigate the overal effect of the vaccine on the IGHV family usage.

[^5]: Laserson U and Vigneault F, et al. High-resolution antibody dynamics of
vaccine-induced immune responses. Proc Natl Acad Sci USA. 2014 111:4928-33.

## Input data
The data is already formatted such that it can directly be passed as input to
IgGeneUsage. Lets load the data, and to inspect its content.

```{r}
data(Ig, package = "IgGeneUsage")
```

```{r}
kable(x = head(Ig), row.names = FALSE)
```


## Data visualization
Lets look at the gene usage data with ggplot2.

```{r, fig.height = 4, fig.width = 6}
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition,
                         FUN = sum, data = Ig)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL

# merge it with the original data
viz <- merge(x = Ig, y = total.usage,
             by = c("sample_id", "condition"),
             all.x = TRUE)

# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100

# visualize
ggplot(data = viz)+
  geom_point(aes(x = gene_name, y = gene_usage_pct, fill = condition, 
                 shape = condition), stroke = 0, size = 3,
             position = position_jitterdodge(jitter.width = 0.25, 
                                             dodge.width = 0.75))+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [%]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_shape_manual(name = "Condition", values = c(21, 22))
```


## Analysis with IgGeneUsage
Lets analyze the differential gene usage in the data Ig. We use the 
default values of most DGU arguments.

```{r}
M <- DGU(usage.data = Ig,
         mcmc.warmup = 500,
         mcmc.steps = 1500,
         mcmc.chains = 2,
         mcmc.cores = 1,
         hdi.level = 0.95,
         adapt.delta = 0.95,
         max.treedepth = 12)
```


## Model checking

### Checking MCMC sampling

  * divergences, tree-depth, energy

```{r}
rstan::check_hmc_diagnostics(M$glm)
```

  * Rhat and Neff

```{r, fig.height = 3, fig.width = 6}
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
                        rstan::stan_ess(object = M$glm),
                        nrow = 1)
```


### Posterior predictive checks: repertoire-specific

```{r, fig.height = 5, fig.width = 8}
ggplot(data = M$ppc$ppc.repertoire)+
  facet_wrap(facets = ~sample_name, ncol = 4)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_x_log10(breaks = c(1, 10, 100, 1000), 
                labels = expression(10^0, 10^1, 10^2, 10^3))+
  scale_y_log10(breaks = c(1, 10, 100, 1000), 
                labels = expression(10^0, 10^1, 10^2, 10^3))+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "Predicted usage [counts]")+
  annotation_logticks(base = 10, sides = "bl")
```


### Posterior predictive checks: overall

```{r, fig.height = 4, fig.width = 6}
ggplot(data = M$ppc.data$ppc.gene)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100), col = "darkgray")+
  geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "Predicted usage [%]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "bl")
```


## Results
Main results:

```{r}
kable(x = M$glm.summary, row.names = FALSE, digits = 3)
```

The effect size and $\pi$ are related. Lets visualize them for each gene (shown 
as a point). There are no genes with $\pi \approx 1$ $\rightarrow$ no sufficiently
strong evidence of differential gene usage based on IgGeneUsage. 

```{r, fig.height = 4, fig.width = 6}
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)

stats <- merge(x= stats, y = M$test.summary, by = "gene_name")

ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean))+
  geom_text_repel(data = stats, aes(x = pmax, y = es_mean, label = gene_fac))+
  theme_bw(base_size = 11)+
  xlab(label = expression(pi))+
  xlim(0, 1)
```


## Comparison with the Welch's t-test (T-test)

```{r, fig.height = 6, fig.width = 8}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(t.test.fdr.pvalue)))+
  geom_text_repel(data = stats, 
                  aes(x = pmax, y = -log10(t.test.fdr.pvalue), 
                      label = gene_name), size = 4, 
                  min.segment.length = 0.1)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


### Outlier: IGHV5

```{r, fig.height = 5, fig.width = 8}
promising.genes <- unique(M$usage.data$gene_names)

ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]

ggplot()+
  geom_errorbar(data = ppc.gene,
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, col = condition),
                position = position_dodge(width = .8), width = 0.75)+
  geom_point(data = viz[viz$gene_name %in% promising.genes, ],
             aes(x = gene_name, y = gene_usage_pct, col = condition),
             shape = 21, size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.15, 
                                             jitter.height = 0, 
                                             dodge.width = 0.8))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")
```


### Outliers IGHV5 (original data)

```{r, fig.height = 4, fig.width = 7}
promising.genes <- "IGHV5"

ggplot(data = viz[viz$gene_name %in% promising.genes, ])+
  facet_wrap(facets = ~gene_name, ncol = 1, scales = "free_y")+
  geom_point(aes(x = sample_id, y = gene_usage_count, fill = condition, 
                 size = total/10^3), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_size_continuous(name = expression("N ("*10^3*")"), range = c(1, 5))+
  ylab(label = "Usage [count]")+
  xlab(label = '')+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```


## Comparison with the Wilcoxon signed-rank test (U-test)
No evidence of differential usage (at FDR level 0.05)

```{r, fig.height = 4.5, fig.width = 8}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(u.test.fdr.pvalue)))+
  geom_text_repel(data = stats, 
                  aes(x = pmax, y = -log10(u.test.fdr.pvalue), 
                      label = gene_name), size = 4, 
                  min.segment.length = 0.1)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from U-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


# Case Study III
The main purpose of using IgGeneUsage is to discover differential gene usage. 
However, the tool can be be used for all differential count analyses. In the 
third case study we explain how to detect differential usage of net CDR3 
sequence charge between repertoires belonging to two biological conditions. For 
this purpose we use CDR3 sequence data of human T-cells (TRB-chain) downloaded 
from VDJdb [^4]. 

Each CDR3 sequence is annotated to a specific viral epitope. Here we focus 
on the two viruses InfluenzaA and CMV which represent the two biological 
conditions. We consider the different data sources (publications) as different 
repertoires. To compute the net sequence charge, we consider the amino acids 
K, R and H as +1 charged, while D and E as -1 charged. Thus, we computed the 
net charge of a CDR3 sequence by adding up the individual residue charges. 

[^4]: https://vdjdb.cdr3.net/


## Input data
The data is already formatted such that it can directly be passed as input to
IgGeneUsage. Lets load the data, and to inspect its content.

```{r}
data(CDR3_Epitopes, package = "IgGeneUsage")
```

```{r}
kable(x = head(CDR3_Epitopes), row.names = FALSE)
```


## Data visualization
Lets look at the input data with ggplot2.

```{r, fig.height = 4, fig.width = 7}
# we can compute the total number of rearrangements per sample
total.usage <- aggregate(gene_usage_count~sample_id+condition,
                         FUN = sum, data = CDR3_Epitopes)
total.usage$total <- total.usage$gene_usage_count
total.usage$gene_usage_count <- NULL

# merge it with the original data
viz <- merge(x = CDR3_Epitopes, y = total.usage,
             by = c("sample_id", "condition"),
             all.x = TRUE)

# compute %
viz$gene_usage_pct <- viz$gene_usage_count/viz$total*100

# visualize
ggplot(data = viz)+
  geom_point(aes(x = as.numeric(gene_name), 
                 y = gene_usage_pct, fill = condition, 
                 shape = condition), stroke = 0, size = 2, alpha = 0.5,
             position = position_jitterdodge(jitter.width = 0.15, 
                                             dodge.width = 0.5))+
  theme_bw(base_size = 11)+
  ylab(label = "Usage [%]")+
  xlab(label = 'Net Charge')+
  theme(legend.position = "top")+
  scale_fill_manual(name = "Condition", values = c("orange", "#6666ff"))+
  scale_shape_manual(name = "Condition", values = c(21, 22))+
  scale_x_continuous(breaks = -7:7, labels = -7:7)
```


## Analysis with IgGeneUsage
Here we study differential gene usage in the data CDR3_Epitopes. We use the 
default values of most DGU arguments.

```{r}
M <- DGU(usage.data = CDR3_Epitopes, 
         mcmc.chains = 2, 
         mcmc.cores = 1)
# default DGU parameters:
# mcmc.warmup = 500
# mcmc.steps = 1,500
# mcmc.chains = 2 
# mcmc.cores = 1
# hdi.level = 0.95
# adapt.delta = 0.95
# max.treedepth = 12
```


## Model checking

### MCMC sampling

  * divergences, tree-depth, energy

```{r}
rstan::check_hmc_diagnostics(M$glm)
```

  * Rhat and Neff

```{r, fig.height = 3, fig.width = 6}
grid.arrange(rstan::stan_rhat(object = M$glm),
             rstan::stan_ess(object = M$glm),
             nrow = 1)
```


### Posterior predictive checks: repertoire-specific

```{r, fig.height = 8, fig.width = 8}
ggplot(data = M$ppc.data$ppc.repertoire)+
  facet_wrap(facets = ~sample_name, ncol = 4)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count, fill = condition),
             shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "Predicted usage [counts]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "bl")
```


### Posterior predictive checks: overall differential usage

```{r, fig.height = 4, fig.width = 6}
ggplot(data = M$ppc.data$ppc.gene)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100), col = "darkgray")+
  geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "Predicted usage [%]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "bl")
```


## Results
Main results:

```{r}
kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
```

The effect size and $\pi$ are related. Lets visualize them for each gene (shown 
as a point). There are couple of net-charges groups with $\pi \approx 1$.

```{r, fig.height = 4, fig.width = 6}
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)

stats <- merge(x = stats, y = M$test.summary, by = "gene_name")

ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean))+
  geom_text_repel(data = stats,
                  aes(x = pmax, y = es_mean, label = gene_fac))+
  theme_bw(base_size = 11)+
  xlab(label = expression(pi))+
  scale_x_continuous(limits = c(0,  1))
```


## Comparison with the Welch's t-test

```{r, fig.height = 4, fig.width = 6}
ggplot(data = stats)+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(aes(x = pmax, y = -log10(t.test.fdr.pvalue)), 
             size = 1.5, col = "red")+
  geom_text_repel(aes(x = pmax, y = -log10(t.test.fdr.pvalue), 
                label = gene_name), size = 5)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from t-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


## Comparison with the Wilcoxon signed-rank test (U-test)

```{r, fig.height = 4, fig.width = 6}
ggplot(data = stats)+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(aes(x = pmax, y = -log10(u.test.fdr.pvalue)), 
             size = 1.5, col = "red")+
  geom_text_repel(aes(x = pmax, y = -log10(u.test.fdr.pvalue), 
                label = gene_name), size = 5)+
  xlim(0, 1)+
  ylab(label = "-log10(p-value) from U-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```


### Net charge usage [in %] groups and posterior means (and 95% HDI of means) 

```{r, fig.height = 5, fig.width = 8}
promising.genes <- M$usage.data$gene_names

ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]
ppc.gene$gene_name <- factor(x = ppc.gene$gene_name, levels = -5:6)

ggplot()+
  geom_errorbar(data = ppc.gene,
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, col = condition),
                position = position_dodge(width = .8), width = 0.75)+
  geom_point(data = viz[viz$gene_name %in% promising.genes, ],
             aes(x = gene_name, y = gene_usage_pct, col = condition),
             shape = 21, size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.2, 
                                             jitter.height = 0, 
                                             dodge.width = 0.75))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "Gene usage [%]")+
  xlab(label = '')
```