---
title: "animalcules"
author:
- name: Yue Zhao
  affiliation:
  - &1 Boston University School of Medicine, Boston, MA
  - &2 Bioinformatics Program, Boston University, Boston, MA
- name: Anthony Federico
  affiliation:
  - *1
  - *2
- name: Evan Johnson
  affiliation:
  - *1
  - *2
date: '`r format(Sys.Date(), "%B %e, %Y")`'
package: animalcules
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{animalcules}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---

```{r, include=FALSE, eval=FALSE}
knitr::opts_chunk$set(comment = "#", message = FALSE)

if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

library(devtools)

devtools::load_all(".")
library(SummarizedExperiment)
```

# Citation

**Note:** if you use animalcules in published research, please cite:

> Zhao, Y., Federico, A., Faits, T. et al. (2021)
> animalcules: interactive microbiome analytics and visualization in R
> *Microbiome*, 9(1), 1-16.
> [10.1186/s40168-021-01013-0](https://doi.org/10.1186/s40168-021-01013-0)


# Introduction

animalcules is an R package for utilizing up-to-date data analytics, 
visualization methods, and machine learning models to provide users 
an easy-to-use interactive microbiome analysis framework. It can be 
used as a standalone software package or users can explore their data 
with the accompanying interactive R Shiny application. Traditional 
microbiome analysis such as alpha/beta diversity and differential 
abundance analysis are enhanced, while new methods like biomarker 
identification are introduced by animalcules. Powerful interactive 
and dynamic figures generated by animalcules enable users to understand 
their data better and discover new insights. 

# Installation

Install the development version of the package from Bioconductor.
```{r get_package, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("wejlab/animalcules")
```

Or install the development version of the package from Github.
```{r, eval=FALSE}
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("wejlab/animalcules")
```

# Load Packages

Load the packages needed into R session.
```{r load}
library(animalcules)
library(SummarizedExperiment)
```

# Run Shiny App

This command is to start the animalcules shiny app. 
For shiny app tutorials, please go to 
[our website](https://wejlab.github.io/animalcules-docs/index.html),
and click the tab "Interactive Shiny Analysis".

The rest of the vignette is for the command line version of animalcules,
which could also be found in
[our website](https://wejlab.github.io/animalcules-docs/index.html),
and click the tab "Command Line Analysis".

```{r, eval=FALSE}
run_animalcules()
```


# Load Toy Dataset

This toy dataset contains 50 simulated samples. Throughout the vignette, 
we will use the data structure MultiAssayExperiment (MAE) as the input
to all functions.

```{r}
data_dir <- system.file("extdata/TB_example_dataset.rds", 
    package = "animalcules")
MAE <- readRDS(data_dir)
```

For users who would like to use their own dataset, please follow the
following steps:

* Use the shiny app to upload your dataset. [check shiny tutorial]
(https://wejlab.github.io/animalcules-docs/articles/rmd/upload.html)
* Go to tab 2 "Data Summary and Filtering", and click the button 
"Download Animalcules File". [check shiny tutorial]
(https://wejlab.github.io/animalcules-docs/articles/rmd/summary.html)
* Load the downloaded .rds animalcules file and rename it as MAE in R.

```{r, eval=FALSE}
data_dir <- "PATH_TO_THE_ANIMALCULES_FILE"
MAE <- readRDS(data_dir)
```

# Summary and Categorize

One of the first steps in the data analysis process involves summarizing 
the data and looking for outliers or other obvious data issues that may
cause issue with downstream analysis.

## Summary Plot (Pie Chart or Box plot)

One type of summarization plot returns either a box plot or pie chart for 
continous or categorical data respectively.

```{r}
p <- animalcules::filter_summary_pie_box(MAE,
    samples_discard = c("SRR1204622"),
    filter_type = "By Metadata",
    sample_condition = "age_s"
)
p
```

## Summary Plot (Density Plot or Bar Plot)

One type of summarization plot returns either a density plot or bar plot
for continous or categorical data respectively.

```{r}
p <- filter_summary_bar_density(MAE,
    samples_discard = c("SRR1204622"),
    filter_type = "By Metadata",
    sample_condition = "sex_s"
)
p
```

## Categorize

It is often necessary to bin continuous data into categories when 
performing analyses that require categorical input. To help ease 
this process, users can automatically categorize categorical data 
and provide custom bin breaks and labels in doing so.

```{r}
microbe <- MAE[["MicrobeGenetics"]]
samples <- as.data.frame(colData(microbe))
result <- filter_categorize(samples,
    sample_condition = "age_s",
    new_label = "AGE_GROUP",
    bin_breaks = c(0, 30, 40, 100),
    bin_labels = c("a", "b", "c")
)
head(result$sam_table)
result$plot.unbinned
result$plot.binned
```

# Visualization

A typical analysis involves visualization of microbe abundances across 
samples or groups of samples. Animalcules implements three common types 
of visualization plots including stacked bar plots, heat map, and box
plots generated with Plotly. 

## Relative Abundance Stacked Bar Plot

The stacked bar plots are used to visualize the relative abundance of 
microbes at a given taxonomical level in each sample represented 
as a single bar.

```{r}
p <- relabu_barplot(MAE,
    tax_level = "genus",
    sort_by = "conditions",
    sample_conditions = c("Disease"),
    show_legend = TRUE
)
p
```

## Relative Abundance Heatmap

The heatmap represents a sample by organisms matrix that can be visualized 
at different taxonomic levels.

```{r}
p <- relabu_heatmap(MAE,
    tax_level = "genus",
    sort_by = "conditions",
    sample_conditions = c("sex_s", "age_s")
)
p
```

## Relative Abundance Boxplot

The boxplot visualization allows users to compare the abundance of one or 
more organisms between categorical attributes.

```{r}
p <- relabu_boxplot(MAE,
    tax_level = "genus",
    organisms = c("Streptococcus", "Staphylococcus"),
    condition = "sex_s",
    datatype = "logcpm"
)
p
```

# Diversity

## Alpha Diversity Boxplot

Alpha diversity, which describes the richness and evenness of sample 
microbial community, is a vital indicator in the microbiome analysis. 
animalcules provides the interactive boxplot comparison of alpha 
diversity between selected groups of samples. Both taxonomy levels 
and alpha diversity metrics (Shannon, Gini Simpson, Inverse Simpson) 
can be changed. Users can also conduct alpha diversity statistical 
tests including Wilcoxon rank sum test, T test and Kruskal-Wallis test.  

Plot the alpha diversity boxplot between the levels in selected condition.

```{r}
alpha_div_boxplot(
    MAE = MAE,
    tax_level = "genus",
    condition = "Disease",
    alpha_metric = "shannon"
)
```

## Alpha Diversity Statistical Test

Conduct statistical test on the alpha diversity between the levels 
in selected condition.

```{r}
do_alpha_div_test(
    MAE = MAE,
    tax_level = "genus",
    condition = "Disease",
    alpha_metric = "shannon",
    alpha_stat = "T-test"
)
```

## Beta Diversity Heatmap

On the other hand, by defining distances between each sample, beta 
diversity is another key metric to look at. Users can plot the beta 
diversity heatmap by selecting different beta diversity dissimilarity
metrics including Bray-Curtis and Jaccard. Users can also conduct beta 
diversity statistical testing between groups including PERMANOVA, Wilcoxon 
rank sum test and Kruskal-Wallis. 

Plot the beta diversity heatmap with selected condition.

```{r}
diversity_beta_heatmap(
    MAE = MAE,
    tax_level = "genus",
    input_beta_method = "bray",
    input_bdhm_select_conditions = "Disease",
    input_bdhm_sort_by = "condition"
)
```

## Beta Diversity Boxplot

Plot the beta diversity boxplot within and between conditions.

```{r}
diversity_beta_boxplot(
    MAE = MAE,
    tax_level = "genus",
    input_beta_method = "bray",
    input_select_beta_condition = "Disease"
)
```

## Beta Diversity Test

Conduct statistical test on the beta diversity between the levels in 
selected condition.

```{r}
diversity_beta_test(
    MAE = MAE,
    tax_level = "genus",
    input_beta_method = "bray",
    input_select_beta_condition = "Disease",
    input_select_beta_stat_method = "PERMANOVA",
    input_num_permutation_permanova = 999
)
```
## Beta Diversity NMDS Plot

Display an NMDS plot based on the selected beta diversity analysis performed.

```{r}
diversity_beta_NMDS(
    MAE = MAE,
    tax_level = "genus",
    input_beta_method = "bray",
    input_select_beta_condition = "Disease"
)
```

# Dimensionality Reduction

## PCA

A wrapper for conduction 2D and 3D Principal Component Analysis.

```{r}
result <- dimred_pca(MAE,
    tax_level = "genus",
    color = "age_s",
    shape = "Disease",
    pcx = 1,
    pcy = 2,
    datatype = "logcpm"
)
result$plot
head(result$table)
```

## PCoA

A wrapper for conduction 2D and 3D Principal Coordinate Analysis.

```{r}
result <- dimred_pcoa(MAE,
    tax_level = "genus",
    color = "age_s",
    shape = "Disease",
    axx = 1,
    axy = 2,
    method = "bray"
)
result$plot
head(result$table)
```

## UMAP

A wrapper for conduction 2D and 3D Uniform Manifold Approximation & Projection.

```{r}
result <- dimred_umap(MAE,
    tax_level = "genus",
    color = "age_s",
    shape = "Disease",
    cx = 1,
    cy = 2,
    n_neighbors = 15,
    metric = "euclidean",
    datatype = "logcpm"
)
result$plot
```

## t-SNE

A wrapper for conduction 2D and 3D t-distributed stochastic neighbor embedding.

```{r}
# result <- dimred_tsne(MAE,
#                       tax_level="phylum",
#                       color="age_s",
#                       shape="Disease",
#                       k="3D",
#                       initial_dims=30,
#                       perplexity=10,
#                       datatype="logcpm")
# result$plot
```

# Differential Analysis

Here in animalcules, we provide a DESeq2-based differential abundance
analysis. Users can choose the target variable, covariate variable, 
taxonomy level, minimum count cut-off, and an adjusted p-value threshold. 
The analysis report will output not only the adjusted p-value and 
log2-fold-change of the microbes, but also the percentage, prevalence, 
and the group size-adjusted fold change.

```{r}
p <- differential_abundance(MAE,
    tax_level = "phylum",
    input_da_condition = c("Disease"),
    min_num_filter = 2,
    input_da_padj_cutoff = 0.5
)
p
```

# Biomarker

One unique feature of animalcules is the biomarker identification module 
built on machine learning models. Users can choose one classification 
model from logistic regression, gradient boosting machine, or random 
forest to identify a microbes biomarker list. The feature importance 
score for each microbe will be provided. To evaluate the biomarker 
performance, the ROC plot and the AUC value using cross-validation 
outputs are shown to users. Note: Results may vary each run.

## Train biomarker

```{r}
p <- find_biomarker(MAE,
    tax_level = "genus",
    input_select_target_biomarker = c("Disease"),
    nfolds = 3,
    nrepeats = 3,
    seed = 99,
    percent_top_biomarker = 0.2,
    model_name = "logistic regression"
)
# biomarker
p$biomarker

# importance plot
p$importance_plot

# ROC plot
p$roc_plot
```

# Session Info

```{r}
sessionInfo()
```