---
title: "funOmics"
author: "Elisa Gómez de Lope"
date: "2024-02-16"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{funOmics}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

The `funOmics` R package is a collection of functions designed to aggregate omics data into higher-level functional representations such as pathways, protein complexes, and cellular locations. This vignette provides a detailed guide on how to use the package.

Omics data analysis is a critical component of modern biomedical research. The `funOmics` package provides a tool for aggregating omics data from high-throughput experiments (e.g. transcriptomics, metabolomics, proteomics) into higher-level functional activity scores that can then be used for further analysis and modeling. This capability provides a more global view of the biological systems, reduces the dimensionality, and facilitates biological interpretation of results.

The package provides different pooling operators, such as aggregation statistics (mean, median, standard deviation, min, max), dimension-reduction derived scores (PCA, NMF, MDS, _pathifier_ deregulation scores from the `pathifier` package), or test statistics (t-test, Wilcoxon test, Kolmogorov–Smirnov test) with options for adjusting parameters and settings to suit specific research questions and data types. The package is also well-documented, with detailed descriptions of each function and several examples of usage. 

`funOmics` distinguishes itself from existing Bioconductor packages dedicated to pathway or gene set analysis such as GSEA and ORA (`clusterProfiler`, `fgsea`, `GSEAset`), or `GSVA`, by offering a comprehensive tool for directly aggregating diverse omics data types into higher-level functional representations, allowing the analysis of such functional representations as functional activity scores that can be modeled as input features for identifying candidate biomarkers, or in clustering strategies for patient identification. Unlike GSEA and ORA, which primarily focus on gene expression and predefined gene sets, `funOmics` accommodates various omics modalities (e.g., metabolomics, transcriptomics, proteomics), and allows users to define custom molecular sets for aggregation. Additionally, `funOmics` goes beyond `GSVA` by providing flexibility in the choice of aggregation operators, enabling users to derive interpretable functional activity scores tailored to their specific research questions. By offering a flexible and user-friendly, alternative tool for functional analysis, `funOmics` aims to contribute to the diverse array of Bioconductor packages and enhance the capabilities of the community.

# Installation

Install `funOmics` from [Bioconductor](https://www.bioconductor.org/) (release 3.19 onwards) via:

``` {r,eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("funOmics")
```

or the pre-release and latest development version from [GitHub](https://github.com/elisagdelope/funOmics):

```{r,eval=FALSE}
if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("elisagdelope/funOmics") 
```


# Usage

## Loading the Package

To use the `funOmics` R package, load it with the following command:

```{r}
library(funOmics)
```

## Functions `get_kegg_sets` and `short_sets_detail`

The function `get_kegg_sets` retrieves KEGG pathway gene sets for a specified organism.  It fetches all pathways available for the specified organism from the KEGG database and maps the genes involved in each pathway. Currently, the function only supports choice of gene identifiers (entrez IDs, gene symbols and Ensembl IDs) for Homo sapiens (organism = "hsa") using the `org.Hs.eg.db` package.

`get_kegg_sets` has two parameters: `organism` and `geneid_type`. The parameter `organism` provides the organism abbreviation for which KEGG pathway gene sets are to be retrieved (e.g., "ecj" for E. coli). Default is "hsa" (Homo sapiens). `geneid_type` provides the type of gene IDs to provide and is only used when the organism is "hsa" (Homo sapiens). The default is "entrez"; options are "entrez", "symbol", or "ensembl". 

The function `get_kegg_sets()` returns a list where each element represents a KEGG pathway gene set (i.e., a list of lists). The names of the inner lists correspond to the pathway names. For further details, the function documentation can be accessed by running the following command:

```{r,eval=FALSE}
?funOmics::get_kegg_sets
```

The function `short_sets_detail` identifies molecular sets with sizes less than a specified threshold, and returns information about these short sets. It has two parameters: `sets` and `minsize`. The parameter `sets` is a list of molecular sets, like that obtained from the function `get_kegg_sets`. The parameter `minsize` provides the minimum size threshold for sets to be categorized as "short" and subsequently processed to extract information. Function documentation can be accessed by running:

```{r,eval=FALSE}
?funOmics::short_sets_detail
```


### Examples usage

Let's retrieve KEGG pathway gene sets for Homo sapiens with entrez IDs (default):
```{r}
hsa_kegg_sets_entrez <- get_kegg_sets()
head(hsa_kegg_sets_entrez)
```

The KEGG molecular sets can also be retrieved for gene symbols with the `geneid_type = "symbol"` flag:
```{r}
hsa_kegg_sets_symbol <- get_kegg_sets(geneid_type = "symbol")
hsa_kegg_sets_symbol[1]
```

And similarly for Ensembl IDs with the `geneid_type = "ensembl"` flag:
```{r}
hsa_kegg_sets_ensembl <- get_kegg_sets(geneid_type = "ensembl")
hsa_kegg_sets_ensembl[1]
```

`get_kegg_sets` can also be used to retrieve KEGG pathway gene sets for another organism (e.g., Escherichia coli). Note that the choice of gene identifier is currently not supported for organisms other than Homo sapiens, hence the gene type is that stored by the KEGG database.
```{r}
ecoli_kegg_sets <- get_kegg_sets(organism = "ecj")
head(ecoli_kegg_sets)
```

## Main Function: `summarize_pathway_level`

You can then access the main function provided by the package, *summarize_pathway_level* with the type of pooling operator desired to be applied for each molecular set. This function has several options for adjusting parameters and settings to suit specific research questions and data types. The available aggregation operators and other parameters options are described in detail in the package documentation. You can also see the documentation for this function with the command:

```{r,eval=FALSE}
?funOmics::summarize_pathway_level
```

Find below some examples of usage with transcriptomics data from the `airway` dataset.

## Summarizing omics data in `SummarizedExperiment` format from `airway` into KEGG pathway level functional activity scores 

Here we illustrate through some examples how to apply the function `summarize_pathway_level` to aggregate data in `SummarizedExperiment` format into KEGG pathway-level functional activity scores. Note that the function can also be used to summarize other types of omics data into any higher-level functional representations beyond pathways, such as protein complexes or cellular locations. 

Let's first get an example dataset stored as a `SummarizedExperiment` from the `airway` package. This data represents an actual RNA sequencing experiment on four human airway smooth muscle cell lines.

```{r}
library(SummarizedExperiment)
library(airway)
data(airway)
airway
```

The measurement data can be accessed by assay and assays. Note that `SummarizedExperiment` object can contain multiple measurement matrices (all of the same dimension), but in this case `airway` contains only one matrix of RNA sequencing data named `counts`:

```{r}
assayNames(airway)
```

```{r}
head(assay(airway, "counts"))
dim(assay(airway, "counts"))
```

The data matrix contains 63677 genes (or transcripts) and 8 samples. The features names are Ensembl identifiers, let's get a list of KEGG gene sets with Ensembl IDs through the function `get_kegg_sets` provided by `funOmics` package. Note that `get_kegg_sets` can be used to retrieve a list of KEGG gene sets from any organism available, given its abbreviation (e.g., "hsa" for Homo sapiens or "ecj" for Escherichia coli). 

Since `airway` data corresponds to human samples, the parameter `geneid_type` in `get_kegg_sets` can be used to retrieve the molecular sets with Ensembl IDs, and the organism is set to default ("hsa"):


```{r}
kegg_sets <- get_kegg_sets(geneid_type = "ensembl")
head(kegg_sets)
```

#### Example usage 1: Summarize `airway` omics data into dimension-reduction derived activity scores at KEGG pathway level. 

The dimension-reduction operators implemented in `funOmics` include PCA (Principal Component Analysis), NMF (Non-Negative Matrix Factorization), MDS (Multidimensional Scaling), and _pathifier_ deregulation scores from the `pathifier` package derived from principal curves. 

Now, let's summarize the counts data using PCA. The PCA-aggregated activity scores values represent the projection of the overall expression of all genes in each pathway onto the first principal component. For this example, let's use the default minimum size of sets (10). Note that when default value for `minsize` parameter is used, it is not necessary to assign a value for this parameter in the function call:

```{r}
pathway_activity_pca <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "pca")
```

From the original `airway` data containing 63677 genes (transcripts) and 8 samples, the function `summarize_pathway_level` has generated a pathway-level activity score for each of the 8 samples, for 343 KEGG pathways containing more than 10 genes (16 failed functional aggregations under `minsize`). Let's see how this matrix looks like:

```{r}
print(head(pathway_activity_pca))
```

The resulting matrix of higher-level functional representations looks very similar to the original one, except that the original had many more features (63677 instead of 343). This reduction in dimensionality can facilitate the interpretation of the data and the identification of patterns in the samples.

In this illustrative example, the RNA sequencing data in `airway` package has been directly summarized or aggregated by the `summarize_pathway_level` function of the `funOmics` package without intermediate processing. Depending on the type of omics data, you may want to apply corresponding processing steps to the omics abundance matrix prior to the aggregation into higher level functional features. For instance, it is common practice to filter out rows (genes or features) with low counts when analyzing transcriptomics data. This filtering step is often performed to remove genes that are expressed at very low levels and may not be biologically relevant or reliable.

Some aggregation methods, may also have specific assumptions or requirements on the input data. Let's see another example where some filtering is indeed necessary.

`funOmics` allows to generate aggregated representations using dimension-reduction derived scores from the NMF (Non-Negative Matrix Factorization) method. The NMF-aggregated activity scores values represent the weight (or contribution) of a single underlying basis component or latent factor contributing to the pathway activity (or higher level functional structure in use) for each sample in your data set. Rank=1 is used for the basis matrix in the internal NMF dimension-reduction.

Notably, the NMF method does not allow for negative values or null rows in the input matrix. Transcriptomics data in the `airway` dataset are measured as counts, hence the matrix presumably does not contain negative values, but it may contain null rows. To avoid errors, we can filter out rows with less than 10 counts across all samples before applying the NMF method:

```{r}
print(any(assay(airway, "counts")[rowSums(assay(airway, "counts")) <0, ]))
X <- assay(airway, "counts")[rowSums(assay(airway, "counts")) >= 10, ]
```

Let's summarize the filtered counts data using NMF method:

```{r}
pathway_activity_nmf <- summarize_pathway_level(X, kegg_sets, type = "nmf") # note that the NMF operation can take some minutes
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity_nmf), ",", ncol(pathway_activity_nmf)))
```

Let's see how this matrix looks like:

```{r}
head(pathway_activity_nmf)
```


In this example, `summarize_pathway_level` has generated a pathway-level activity score for each of the 8 samples, for 340 KEGG pathways containing more than 10 genes (here 19 failed functional aggregations under `minsize`, slightly more than for the PCA since some genes were removed in the filtering). Note that the resulting matrix looks similar to that of the previous example in terms of shape and format, but the values are derived from the NMF dimension-reduction method instead of the PCA method. Same analogies apply for other types of aggregation operator; only the interpretation of the resulting functional activity scores will change. 

Note that beyond pre-processing, you can also post-process the resulting summarized matrix as you see appropriate for your analyses and workflows.


#### Integrating the pathway-level activity scores with the `airway` `SummarizedExperiment` object in a `MultiAssayExperiment` object

The resulting matrix of pathway-level activity scores can be further analyzed as an independent dataset, or can also be integrated with the `airway` `SummarizedExperiment` object in a `MultiAssayExperiment` structure (note that `SummarizedExperiment` can simultaneously manage several experimental assays only if they have the same dimensions, which is not the case here, hence the need for a `MultiAssayExperiment` object). The MultiAssayExperiment library has to be loaded, and a MultiAssayExperiment (`airwayMultiAssay`) can be created and filled with a list of assays-like matrices that may have different dimensions. Here, `airwayMultiAssay` contains the `counts` and the recently generated KEGG pathway activity scores by NMF and PCA pooling. 

```{r}
library(MultiAssayExperiment)
assays_list <- list( counts = assay(airway, "counts"), kegg_nmf_agg = pathway_activity_nmf, kegg_pca_agg = pathway_activity_pca)
airwayMultiAssay <- MultiAssayExperiment(experiments=assays_list)
colData(airwayMultiAssay) <- colData(airway)
airwayMultiAssay
```


#### Example usage 2: Summarize `airway` omics data with summary statistics and a minimum size of the KEGG gene sets

Here we will apply the function `summarize_pathway_level` to summarize pathway activity using the mean pooling aggregation for those sets containing at least 12 genes. Remember that you can adjust the parameters `minsize` and `type` of aggregation as desired.

```{r}
min <- 12
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "mean", minsize = min)
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
```

Now from the original `airway` data of dimensions 63677 genes (transcripts) x 8 samples, `summarize_pathway_level` has generated through mean-pooling a pathway-level activity score matrix of 341 pathways x 8 samples, for gene sets containing more than 12 genes. 

```{r}
print(head(pathway_activity))
```

In this example, 18 of the gene sets in `kegg_sets` have size \< 12. You can then use the function `short_sets_detail` to get information about which pathways have been left out, how many genes they had, and which genes are involved in these shorter sets:

```{r}
short_sets <- short_sets_detail(kegg_sets, min)
print(short_sets$short_sets)
```

```{r}
print(short_sets$short_sets_lengths)
```

```{r}
print(short_sets$short_sets_molecules)
```


Other summary statistics can be used for the aggregation, such as median, standard deviation, min, or max. See below some more examples with varying number of genes in the gene sets:

```{r}
min <- 15
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "sd", minsize = min)
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
head(pathway_activity)
```

```{r}
min <- 7
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "median", minsize = min)
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
head(pathway_activity)
```


#### Example usage 3: Summarize `airway` omics data with test statistics

Using the same `airway` data and gene sets `kegg_sets` from previous examples, let's generate aggregated representations using test statistics. These operators allow to compare the measurements for each sample between the molecules in each functional set and the molecules not in the given functional set. They may help identify functionally related genes/molecules that exhibit coordinated or significant deviations in their expression patterns across samples. Currently, the implemented available statistical tests in `funOmics` are the t-test, Wilcoxon test, and Kolmogorov–Smirnov test. 


When using test statistics, one has to be mindful about the assumptions of the test and the distribution of the data. For instance, the t-test assumes that the data is normally distributed and compares the means of two groups; the Wilcoxon test assumes that the data is continuous and symmetric, and compares the medians of two groups; the Kolmogorov–Smirnov test is a non-parametric test that does not assume any distribution and compares the entire cumulative distribution functions (i.e, in this context, the overall shapes of the distributions of values) of two groups. Here we provide an example using the t-test statistic:


```{r}
pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "ttest", minsize = 15)
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
```

In this case, `summarize_pathway_level` has generated a pathway-level activity score for each of the 8 samples, for 339 pathways containing more than 15 genes. The resulting test statistic for each sample represents the difference between the two groups (i.e., the functional set and rest of molecules) for each gene or molecule. These statistics can be then be used as features for further analysis or modeling:

```{r}
print(head(pathway_activity))
```

Importantly, this is just an illustrative example. In real-world experiments, you may have to consider that genes or molecules in a certain set might be longer or naturally more abundant and therefore might have higher overall read counts (abundance) compared to genes or molecules outside the set, even if their expression levels (as a proportion of transcripts) are similar. Normalizing the expression counts for all genes (including those within and outside the sets) using a suitable method (e.g., library size normalization, transcript length normalization) can help make the expression levels more comparable before performing the aggregation and set comparisons.


## Molecular sets beyond KEGG and omics matrices beyond `SummarizedExperiment`

The package `funOmics` interoperates with KEGGREST to retrieve molecular sets from the KEGG through the function `get_kegg_sets` (see description and example above). Other real-world molecular sets can be downloaded from several sources. In terms of gene sets, the Gene Ontology is a versatile resource that covers three domains: cellular components, biological processes and molecular functions. Reactome pathways can also be used to generate higher-level functional representations from omics data. Explore the different releases and download the corresponding gene sets for the different types of GO terms, and reactome pathways [here](https://data.broadinstitute.org/gsea-msigdb/msigdb/release/). You can also aggregate genes into protein complexes, which you can find in the [CORUM database](https://mips.helmholtz-muenchen.de/corum/#download/).

Regarding other omics types, such as metabolomics, the function `summarize_pathway_level` can be applied in a similar manner to a metabolomics matrix `X` and KEGG metabolic pathways. Metabolite sets from KEGG pathways can also be downloaded with the [KEGG API](https://www.kegg.jp/kegg/rest/keggapi.html).

After obtaining the molecular sets information, this data has to be formatted as a list of lists (similar to what is obtained from the `get_kegg_sets` function). In other words, you need a structure where you have a list of multiple molecular sets names, and each of these sets is represented as a list of molecule identifiers, such as entrez IDs, PubChem CIDs, Uniprot IDs, etc. 
For instance, let's retrieve gene sets from GO terms for cellular compartments. The information can be downloaded from the [msigdb link](https://data.broadinstitute.org/gsea-msigdb/msigdb/release/) or accessed programmatically as follows:

```{r}
goccdb <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c5.go.cc.v7.5.entrez.gmt"
downdb <- sapply(readLines(goccdb), function(x) strsplit(x, "\t")[[1]])
gocc_sets <- sapply(as.matrix(downdb), function(x) x[3:length(x)])
names(gocc_sets) = sapply(as.matrix(downdb), function(x) x[1])
gocc_sets[1:3]
```

As you can see, the resulting `gocc_sets` object is a list of lists where each element represents a GO cellular compartment gene set. Here you can also use the function `short_sets_detail` to get information about which and how many pathways contain less than a given number of molecules:

```{r}
min <- 8
short_sets <- short_sets_detail(gocc_sets, min)
print(head(short_sets$short_sets_molecules))
```

Notably, omics data does not always come from a `SummarizedExperiment` object. Some times, it is imported from a csv file, generated through other pre-processing steps and packages, or even generated from a simulation. In these cases, the data has to be formatted as a matrix of dimensions `g*s` (`g` molecules and `s` samples).

Let's see an example usage of the GO cellular compartments gene sets `gocc_sets` where omics data is of type matrix. For this purpose, we will create an expression matrix where the expression values are random positive values sampled from a standard normal distribution. Please, note that `funOmics` can be used to aggregate other types of omics data and molecular sets, such as metabolomics or proteomics that may have a similar range of values.

Let's simulate a gene expression matrix `X_expr`, where gene IDs are codes between 1:10000 (to match entrez IDs), and `summarize_pathway_level` can be applied:

```{r}
# Example usage:
set.seed(1)
g <- 10000
s <- 20
X_expr <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(1:g, paste0("s", 1:s)))
print(paste("Dimensions of omics matrix X:", dim(X_expr)[1], "*", dim(X_expr)[2]))
head(X_expr)
```

Now, let's summarize the expression data using standard deviation pooling for the GO cellular compartments gene sets. We won't specify a minimum size of sets in this case, so the default `minsize` of 10 is used:

```{r}
sd_gocc_expr <- summarize_pathway_level(X_expr, gocc_sets, type="sd", minsize=8)
head(sd_gocc_expr)
```

GO cellular compartments level expression signatures have been generated via standard deviation aggregation. You can apply similar procedures for other types of molecular sets, aggregation functions and omics types. The package `funOmics` is conceived to be flexible across omics types and types of molecular sets, so you can also tailor or directly create your own list of molecular sets based on specific criteria of your experiments (e.g., include only protein complexes involved in ubiquitination, or define _ad hoc_ metabolic routes involving specific metabolites).


## Packages & Session information

The `funOmics` package was developed for R version \>= 4.0.3. However, [BioConductor](https://www.bioconductor.org/) release 3.19 runs on R-4.4. See session information and loaded packages below:

```{r}
sI <- sessionInfo()
print(sI, locale = FALSE)
```


# Contact Information

Feedback is very welcome! If you have any questions, issues, or suggestions for improving the `funOmics` package, please use the GitHub issues page or contact [elisa.gomezdelope\@uni.lu](mailto:elisa.gomezdelope@uni.lu){.email}.

# License

The `funOmics` package is released under the terms of the MIT License. See the [LICENSE](https://github.com/elisagdelope/funomics?tab=MIT-1-ov-file) file for more details.