---
title: "MetaboDynamics: analyzing longitudinal metabolomics data with probabilistic models"
package: MetaboDynamics
author: "Katja Danielzik (katja.danielzik@uni-due.de)"

output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{1. MetaboDynamics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

# Background 
Metabolomics data from longitudinal or time-course high-throughput experiments, such as untargeted liquid chromatography-mass spectrometry (LC-MS), allows for insights into changes of metabolite abundances over time. However, most existing tools are limited to comparing only two time points or experimental conditions, and they often rely on frequentist statistical methods. Additionally, most tools are only able to analyze groups of metabolites on a pathway level, potentially missing more general patterns of changes in the metabolism in sparse data.

Here, we introduce MetaboDynamics, a modular workflow of probabilistic models for the analysis of longitudinal metabolomics data. This package provides a flexible framework for analyzing longitudinal data, allowing users to model changes in metabolite abundances over time and identify patterns of interest.

Metabolomics data is often noisy and limited by few replicates due to high costs. To address these challenges, we employ a Bayesian hierarchical model that balances between under-fitting and over-fitting. This model provides a robust method for estimating mean abundances at every time point as well as abundance differences between subsequent time points, even with limited replicates.

The MetaboDynamics workflow requires abundance tables as input, which can include metabolites and their observed abundances, LC-MS peak intensities or areas. The workflow can handle any number of time points, and the dynamics model requires at least three replicates. However, functions to compare cluster and compare dynamics with each other do not require a certain number of replicates and can be applied to calculated mean abundances of e.g. two replicates.

In this vignette, I will demonstrate how to use the MetaboDynamics package to analyze longitudinal metabolomics data, including data preprocessing, model fitting, and interpretation of results. We will show how to apply the package to a simulated data set and how to visualize and interpret the results.

# Setup: load required packages
```{r setup}
library(MetaboDynamics)
library(SummarizedExperiment) # storing and manipulating simulated metabolomics data
library(ggplot2) # visualization
library(dplyr) # data handling
library(tidyr) # data handling
```

# Load data and plot data overview

The MetaboDynamics package includes a simulated data set called 
"longitudinalMetabolomics". This data set consists of 98 metabolites with three 
replicates at four time points (1-4) across three experimental conditions (A-C).
The data set is represented as a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) object, where the abundance tables of each experimental condition are stored in assays (raw abundances in "concentrations", log-transformed transformations in 
"log_con" and scaled (mean=0, sd=1) log-transformed abundances" in "scaled_log"), and the metabolite names, 
KEGG IDs, experimental conditions, and clustering solutions per experimental 
condition are stored in colData and rowData. 

If you have a data frame as input and do not want to convert to a SummarizedExperiment object
please see Vignette "using MetaboDynamics with data frames" as guidance. 

To load the data, execute the following code:

```{r}
data("longitudinalMetabolomics", package = "MetaboDynamics")
```

The next plot shows the raw metabolite abundances.

```{r,fig.wide=TRUE}
# convert to dataframe
abundances <- as.data.frame(cbind(
  as.data.frame(t(assays(longitudinalMetabolomics)$concentrations)),
  as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
  cols = seq_len(4), names_to = "time",
  values_to = "abundance"
)
ggplot(abundances, aes(x = abundance)) +
  geom_freqpoly() +
  theme_bw() +
  facet_grid(cols = vars(time), rows = vars(condition)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  ggtitle("raw data", "raw measurements")
```

In this plot I visualize the distribution of metabolite abundances. We can
see many observations with low abundances and few observations for higher
abundances. This raw data is not distributed normally. So let's log-transform the values.
In the integrated simulated data set this is already done in the column "log_m".

```{r,fig.wide=TRUE}
ggplot(abundances, aes(x = abundance)) +
  geom_density() +
  scale_x_log10() +
  theme_bw() +
  facet_grid(cols = vars(time), rows = vars(condition)) +
  ggtitle("data", "log-transformed values")
```

We can see that the metabolite abundances vary over magnitudes and are
normally distributed after log-transformation. 

The next plot shows the log-transformed dynamics of single metabolites. 

```{r,fig.wide=TRUE}
ggplot(abundances) +
  geom_line(aes(
    x = time, y = abundance, col = metabolite,
    group = interaction(metabolite, replicate)
  )) +
  scale_y_log10() +
  theme_bw() +
  xlab("timepoint") +
  scale_color_viridis_d() +
  theme(legend.position = "none") +
  facet_grid(rows = vars(condition)) +
  ggtitle("raw metabolite dynamics", "color=metabolite")
```

The dynamics profiles of single metabolites have intercepts (i.e. mean metabolite
abundances) that differ by magnitudes from each other. To be able to compare
metabolites with each other we define dynamics as deviations at the observed
time points from the metabolite's mean abundance. For this we standardize each 
metabolite at each experimental condition to a mean of zero and a standard deviation of one. 

In the simulated data set the scaled measurements
are in column "m_scaled".

```{r,fig.wide=TRUE}
data("longitudinalMetabolomics")
# convert to dataframe
abundances <- as.data.frame(cbind(
  as.data.frame(t(assays(longitudinalMetabolomics)$scaled_log)),
  as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
  cols = seq_len(4), names_to = "time",
  values_to = "abundance"
)
ggplot(abundances) +
  geom_line(aes(
    x = time,
    y = abundance, col = metabolite,
    group = interaction(metabolite, replicate)
  )) +
  theme_bw() +
  scale_color_viridis_d() +
  xlab("timepoint") +
  theme(legend.position = "none") +
  facet_grid(rows = vars(condition)) +
  ggtitle("standardized dynamics", "color=metabolite")
```


# Model dynamics
To analyze the dynamics data, we use a Bayesian hierarchical model. 
This model is used to account for the complexity and variability of the data and 
to provide a flexible and robust framework for analysis. The model is defined as
follows with: con = metabolite abundances,
m = metabolite, c = experimental condition and t = time point ID:

\begin{align*}
\log(con_{m,c,t})&\sim {\sf normal}(\mu_{m,c,t},\sigma_{m,c,t}) \\
\mu_{m,c,t}&\sim {\sf normal}(0,2) \\
\sigma_{m,c,t}&\sim {\sf exponential}(\lambda_{m,c}) \\
\lambda_{m,c}&\sim {\sf exponential}(2) 
\end{align*}

The model assumes a normal distribution of log-transformed and scaled metabolite
abundances and estimates a mean $\mu$ per metabolite and time point. The spread $\sigma$ 
of the replicate measurements is also estimated per metabolite and time point, 
but the spread of different time points inform each other through the hierarchical
structure of the hyper-parameter $\lambda$. This partial pooling allows for the
balance between over- and under fitting.

The code below shows how to fit the Bayesian model with the function
fit_dynamics_model(). This might take of the order of 10 minutes 
per experimental condition for 98 metabolites. Here we fit the dynamics model 
for a small subset of the simulated metabolites. 

```{r,fig.wide=TRUE}
# we can hand a SummarizedExperiment object to the function
data("longitudinalMetabolomics")
# we only use a subsection of the simulated data (1 condition and subsample of
# the whole dataset) for demonstration purposes
samples <- c(
  "UMP", "Taurine", "Succinate", "Phosphocreatine", "PEP",
  "Malic acid", "L-Cystine", "CMP", "Citramalic acid", "2-Aminomuconate"
)
# only one condition
data <- longitudinalMetabolomics[, longitudinalMetabolomics$condition == "A" &
  longitudinalMetabolomics$metabolite %in% samples]

# fit model
data <- fit_dynamics_model(
  data = data, scaled_measurement = "m_scaled",
  assay = "scaled_log", time = "time",
  condition = "condition", max_treedepth = 10,
  adapt_delta = 0.95, # default 0.95
  iter = 5000,
  cores = 1,
  chains = 2 # only set to 2 for vignette, default = 4
)
```

This returns a list of model fits that are named by the experimental condition
("A","B","C"). If data is a SummarizedExperiment object the fits are stored in 
metadata(data)[["dynamic_fits"]]. 

MetaboDynamics fits Bayesian models with MCMC (Markov Chain Monte Carlo) which
returns a list of diagnostic criteria: rhat = R-hat convergence diagnostic (should be 1),
neff = number of effective samples (should be above 100), divergences = number of
divergent transitions (should be 0), max_treedepth = number of iterations that
exceed maximum treedepth (should be 0). Hints on how to adjust parameters for
fit_dynamics_model() can be found in the function documentation (type 
"?fit_dynamics_model" in R console if MetaboDynamics is loaded).

With diagnostics_dynamics() we can extract all
the diagnostic criteria of the model fit and visualize them. If data is a 
SummarizedExperiment object the diagnostics are stored in metadata(data)[["diagnostics_dynamics"]].
The diagnostic criteria can be visualized with the function plot_diagnostics().
Additionally data frames for visual posterior predictive checks (PPC) are 
prepared and the PPC can be visualized with the function plot_PPC().

```{r}
# extract diagnostics
data <- diagnostics_dynamics(
  data = data,
  iter = 5000, # number of iterations used for model fitting
  # the dynamic model
  chains = 2 # number of chains used for model fitting
)

plot_diagnostics(
  data = data
)
```

In this example we have zero divergent transitions, zero iterations exceeding
maximum tree depth, Rhat values are consistently below 1.01 and the number of
effective samples are comfortably above 100. All of this indicates that the
fitting of the model went well. 

The result of a Bayesian model (the posterior) should predict the experimental
data well. This can be checked with a posterior predictive check (PPC) in which
the experimental data (points) should lie within the predictions (violins).

```{r}
# PPCs can be accessed with
plot_PPC(
  data = data
)
```

The PPC plot indicates that the model is well-behaved and that the experimental 
data points are consistent with the predicted distribution, suggesting that the 
model is a good fit to the data.

After checking the diagnostic criteria and the PPC we can extract the estimates from the
model fit using the estimates_dynamics() function:
If data is a SummarizedExperiment the estimates are stored in metadata(data)[["estimates_dynamics"]].


```{r,fig.wide=TRUE}
# #extract estimates
data <- estimates_dynamics(
  data = data, iter = 5000,
  chains = 2, condition = "condition"
)
```

This function returns two types of estimates: 1) the estimated metabolite abundance 
differences between two subsequent time points for each metabolite at each 
experimental condition, and 2) the dynamics profiles of each metabolite at each 
experimental condition.

## Differences between two time points

In the following plot the column label "A" indicates the experimental condition.
The row labels "1", "2" and "3" indicate the differences between subsequent
time points. "1" is the difference between time point 1 and 2. "2" between
time point 2 and 3 and "3" between time point 3 and 4. The metabolites are on 
the x-axis and the estimated abundance differences ("delta") are on the y-axis. 

```{r,fig.wide=TRUE}
# 1) the differences between two timepoints
plot_estimates(
  data = data,
  dynamics = FALSE
)
```

The results of bayesian models can be interpreted intuitively: the 95% credible
interval (CrI) (95% highest density interval) indicates the 95% probability range
of the parameter (i.e. in this case the differences in abundances between time points). That
means that with 95% probability the desired values lies within the range of the 
error bars. 
If the 95% CrI lies below zero, we can conclude that there is a decrease in 
abundance between the two subsequent time points (e.g. PEP between time point 3 
and 4, red error bar). If the 95% CrI lies above zero, we can conclude that there
is an increase in abundance between the two subsequent time points (e.g. 
2-Amminomuconate between time point 2 and 3). If the 95% CrI contains zero, we
cannot conclude that there is a difference in abundance between the two time points.


## Dynamic profiles

The dynamic profiles of each metabolite can be visualized using the estimated 
mean abundances at each time point. The plot below shows the estimated mean 
abundances at each time point for each metabolite.

```{r,fig.wide=TRUE}
# 2) dynamic profiles
plot_estimates(
  data = data,
  delta_t = FALSE
)
```

The dynamic profiles can be used to identify patterns and trends in the data,
such as shifts in abundance over time, and to visualize the changes in abundance 
for each metabolite over the observed time interval. 


The estimates of the dynamics model are returned by the function estimates_dynamics() 
and can be accessed with:

```{r}
# get estimates
estimates <- metadata(data)[["estimates_dynamics"]]
# only show first rows of condition A
estimates[["A"]][1:10, ]
```

To identify groups of metabolites that have similar dynamics we could now cluster these 
metabolite specific dynamics vectors (estimates[,"mu_mean"]) to 
determine if groups of metabolites have similar dynamics.

# Cluster dynamics
MetaboDynamics has a build-in wrapper function for clustering with the "hybrid"
method of the package [dynamicTreeCut](https://cran.r-project.org/web/packages/dynamicTreeCut/index.html).
An alternative would for example be to use a soft-clustering algorithm such
as implemented in the Bioconductor package [Mfuzz](https://bioconductor.org/packages/release/bioc/html/Mfuzz.html).
Some clustering methods
also allow to use the overall variation (estimates[,"lamda_mean"]). Clustering precision
could be calculated by using samples from the model posterior (i.e. specifying 
"samples" in estimates_dynamics function) and comparing clustering solutions of
the mean abundances with clustering solutions using the posterior samples 
(estimates[,"mu_sample_n"]).

The code below shows how to use the cluster wrapper function cluster_dynamics
on a SummarizedExperiment object where the dynamics estimates are stored in
metadata(data)[["estimates_dynamics"]]. The function cluster_dynamics() also accepts lists
of dataframes with calculated mean abundances of metabolites if less than
three replicates (minimal requirement for fit_dynamics) are available. 

```{r,fig.wide=TRUE}
data <- cluster_dynamics(data,
                         deepSplit = 0) # low clustering sensitivity
```

For the visualization of the clustering results we can use the function
plot_cluster, which offers three different visualizations of the clustering
results: 1) a dendrogram with color coding of the different clusters. 

```{r}
plot <- plot_cluster(data)
```
2) A Principal Component Analysis (PCA) of the clustering result 

```{r}
plot[["A"]][["PCA_plot"]]
```
and 3) the visualization of the single metabolite dynamics within the different
conditions. For the sake of demonstration we use from here on a clustering result (metaddata(longitudinalMetabolomics[[("cluster"))]]
on the full simulated data set (data("longitudinalMetabolomics")). 

```{r}
plot <- plot_cluster(longitudinalMetabolomics)
plot[["lineplot"]]
```

We can identify different patterns (i.e. clusters) of metabolite dynamics and 
these patterns differ between different experimental conditions. Can we quantify 
the biological function of these dynamics clusters?

# Over-representation analysis of functional modules in dynamics clusters
To quantify the possible biological function of these dynamics clusters we 
retrieved from the KEGG-database the following information:
1. to which functional modules our experimental metabolites are annotated and
2. which metabolites are annotated to functional modules in general.

The functional modules of the KEGG-database are organised in three hierarchies:
upper, middle and lower. The middle hierarchy holds modules such as "Amino acid
metabolism", the lower for example "branched chain amino acid metabolism". To 
facilitate analysis the data frames "metabolite_modules", which 
holds the information about the metabolites of the simulated data set, and "modules_compounds",
which holds the information about which metabolites are in general annotated to
functional modules, were prepared. 

To construct the needed annotation dataframe for your experimental dataset
simply filter the "modules_compounds" dataset for your experimental metabolites'
KEGG IDs. 

We load both data sets and can inspect the
documentation.

```{r}
data("metabolite_modules")
help("metabolite_modules")
head(metabolite_modules)
data("modules_compounds")
help("modules_compounds")
head(modules_compounds)
```

It is important to note that not all KEGG modules are suitable for testing on 
every observed organism and experimental condition. For example, the modules 
"Xenobiotics biodegradation", "Biosynthesis of other secondary metabolites", 
and "Biosynthesis of terpenoids and polyketides" should not be analyzed in a 
human lung cancer cell line.

For the functional analysis we employ a hypergeometric model. We consider a 
functional module as over-represented in a cluster if the 95% inter-quantile
range (ICR) of the log-transformed probabilities of OvEs (observed vs. expected)
lies above zero. OvE refers to the ratio of observed metabolites in a cluster 
being mapped to a functional module over the number of expected metabolites in a
cluster being mapped to a module under the assumption of a hypergeometric 
distribution (=drawing without replacement). If data is a SummarizedExperiment
object where the clustering solution is stored in metadata(data)[["cluster"]]
the ORA results are stored in metadata(data)[["ORA_tested_column"]]
We apply the functional analysis to the middle hierarchy of functional
modules. 

```{r,fig.wide=TRUE}
data <- ORA_hypergeometric(
  background = modules_compounds,
  annotations = metabolite_modules,
  data = longitudinalMetabolomics, tested_column = "middle_hierarchy"
)

plot_ORA(data, tested_column = "middle_hierarchy")
```

Great, we can now see which functional module is over- 
(green points and error-bars) or under-represented (none in this example) in 
which dynamics cluster! For instance in cluster 2 of condition A module
"Nucleotide metabolism" over-represented. 

# Comparison of clusters of different experimental conditions

## Dynamics 
We can not only conduct over-representation analysis of KEGG-functional modules but
also compare dynamics clusters across different experimental conditions. For 
this we employ a Bayesian model that estimates the mean difference as well as 
the standard deviation of differences between dynamics clusters. If data is a 
SummarizedExperiment object the results of compare_metabolites are stored 
in metadata(data)[["comparison_dynamics"]].

dist = vector of pairwise euclidean distances between every dynamic vector of 
cluster a and every dynamic vector of cluster b, ID = cluster pair ID
\begin{align*}
dist_{ID}&\sim {\sf normal}(\mu_{ID},\sigma_{ID}) \\
\mu_{ID}&\sim {\sf normal^+}(0,2) \\
\sigma_{ID}&\sim {\sf exponential}(1) 
\end{align*}

```{r,fig.aling='center',fig.dpi=150}
data("longitudinalMetabolomics")
longitudinalMetabolomics <- compare_dynamics(
  data = longitudinalMetabolomics,
  cores = 1 # just set to 1 for vignette, set to 4 if possible
)
```

We can visualize the posterior estimates of the model with:

```{r,fig.aling='center',fig.dpi=150}
heatmap_dynamics(data = longitudinalMetabolomics)
```
This results in a bubble heat map where each cluster of every condition (i.e.
A_2 = cluster 2 of condition A) is compared with all clusters of every other condition.
The comparison of clusters is reciprocal and therefore each comparison pair is 
only visualized once. 
The bigger and brighter a point, the smaller is the mean distance between
dynamics clusters and the smaller is the standard deviation. That means that big 
bright points indicate high dynamic similarity which small spread. Here C_10 and
A_10 have high similarity in dynamics. 

## Metabolites
We can also compare metabolite composition of clusters. For this we employ
the Jaccard Index which is a metric for similarity of two vectors. Values near
0 indicate low similarity, values near 1 high similarity. If data is a 
SummarizedExperiment object the results of compare_metabolites are stored 
in metadata(data)[["comparison_metabolites"]].

```{r,fig.aling='center',fig.dpi=150}
longitudinalMetabolomics <- compare_metabolites(
  data = longitudinalMetabolomics
)
heatmap_metabolites(data = longitudinalMetabolomics)
```
The brighter a tile, the higher is the similarity of two clusters in regard
to their metabolite composition. The comparison of clusters of one condition
here always have a Jaccard Index of zero, as every metabolite is only assigned
to one cluster per condition, hence there is no similarity between clusters
of one experimental condition. 
We have two clusters that are similar in their metabolite composition:
C_4 and A_3. If we compare that to the dynamics profiles and ORA analysis
we see that similar functional modules are over-expressed as expected BUT
the dynamics differ between the two radiation doses. 

Can we facilitate visualization for the combination of both measures?

## Combine both
```{r,fig.aling='center',fig.dpi=150}
dynamics <- metadata(longitudinalMetabolomics)[["comparison_dynamics"]][["estimates"]]
metabolites <- metadata(longitudinalMetabolomics)[["comparison_metabolites"]]
temp <- left_join(dynamics, metabolites, join_by("cluster_a", "cluster_b"))
x <- unique(c(temp[, "cluster_a"], temp[, "cluster_b"]))
temp <- temp %>% mutate(scale_Jaccard = scale(Jaccard))

ggplot(temp, aes(x = cluster_b, y = cluster_a)) +
  geom_point(aes(size = Jaccard, col = mu_mean)) +
  theme_bw() +
  scale_color_viridis_c(option = "magma") +
  scale_x_discrete(limits = x) +
  xlab("") +
  ylab("") +
  scale_y_discrete(limits = x) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(col = "dynamics distance", size = "metabolite similarity") +
  ggtitle("comparison of clusters", "label = condition + cluster ID")
```

With this form of visualization we can easily spot clusters in which the
experimental condition (e.g. kind of treatment/ dose of treatment) has the
biggest effect. These clusters are similar in regards to their composing metabolites
(i.e. big points)
but dissimilar (i.e. bright color) in regards to their dynamics. Here this is 
for example the cluster pair B_10 and A_5. Their ORA profiles are also quite similar 
as expected from the similar metabolite compositions. 
Clusters of metabolites that are not much affected by the experimental condition
can be spotted by big (i.e. high metabolite similarity) black (i.e. low dissimilarity
in dynamics) points, for example cluster pair C_4 and A_3.

```{r}
sessionInfo()
```