---
title: >
 The `GeDi` User's Guide
author:
- name: Annekathrin Silvia Nedwed
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  email: anneludt@uni-mainz.de
- name: Federico Marini
  affiliation: 
  - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), Mainz
  - Center for Thrombosis and Hemostasis (CTH), Mainz
  email: marinif@uni-mainz.de
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('GeDi')`"
output: 
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{The GeDi User's Guide}
  %\VignetteEncoding{UTF-8}  
  %\VignettePackage{GeDi}
  %\VignetteKeywords{FunctionalAnnotation, Enrichment Analysis, 
  Distance measurements, Exploration, Visualization, GUI}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
bibliography: GeDi.bib
---

<style type="text/css">
.smaller {
  font-size: 10px
}
</style>

**Compiled date**: `r Sys.Date()`

**Last edited**: 2024-02-29

**License**: `r packageDescription("GeDi")[["License"]]`

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  error = FALSE,
  warning = FALSE,
  eval = TRUE,
  message = FALSE,
  fig.width = 8
)
options(width = 100)
```

<hr>

# Introduction {#introduction}

This vignette introduces the usage of the `r BiocStyle::Biocpkg("GeDi")` package
for exploring the results of functional annotation and enrichment analyses.

`r BiocStyle::Biocpkg("GeDi")` is a versatile package designed to simplify the 
exploration and comprehension of functional annotation and enrichment analysis 
results. It offers a `r BiocStyle::CRANpkg("shiny")` application that combines 
interactivity, visualization, and reproducibility to consolidate comprehensive 
outcomes.

To incorporate `r BiocStyle::Biocpkg("GeDi")` into your workflow, you'll need 
the results of a functional annotation or enrichment analysis. This vignette 
demonstrates the core functionalities of `r BiocStyle::Biocpkg("GeDi")` using a
publicly available dataset from Alasoo et al., as described in their paper 
"Shared genetic effects on chromatin and gene expression indicate a role for 
enhancer priming in immune response" [@Alasoo2018].

Accessible through the `r BiocStyle::Biocpkg("macrophage")` Bioconductor package,
this dataset comprises files generated from Salmon quantification (version 
0.12.0, with Gencode v29 reference) and gene-level summarized values.

Within the `r BiocStyle::Biocpkg("macrophage")` experimental setup, samples 
derive from six different donors under four distinct conditions: naive, treated 
with Interferon gamma, with SL1344, or with a combination of Interferon gamma 
and SL1344. For illustration, we will focus on comparing Interferon
gamma-treated samples with naive samples.

# Getting started {#gettingstarted}

Before you can start using GeDi, the package needs to be installed on your 
machine. To install the package, begin by opening R and executing the following
command:

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("GeDi")
```

Once installed, the package can be loaded and attached to your current workspace
as follows:

```{r loadlib}
library("GeDi")
```

With the attached package, you can simply start the application by running 
`GeDi()`. 

```{r launchapp, eval=FALSE}
GeDi()
```

This action will open the application, directing you to the **Welcome** page. 
From there, you can easily provide your data using the **Data Input** panel on 
the left side menu, ensuring it's in the correct format for analysis.

Alternatively, you can initiate the application by executing:

```{r launchappwithData, eval=FALSE}
GeDi(
  genesets = geneset_df,
  ppi = ppi_df,
  distance_scores = distance_scores_df
)
```

where

- `geneset_df` represents your input data in the form of a `data.frame`, which 
should include at least one column named "Genesets" containing geneset 
identifiers and one column named "Genes" containing a comma-separated list of 
genes belonging to each respective geneset.
- `ppi_df` is another `data.frame` containing protein-protein interaction scores,
with columns named "from", "to", and "combined_score".
- `distance_scores_df` is a sparse `Matrix` containing the distance scores of 
the genesets in your data.

All of these parameters are optional, as you can alternatively upload, download,
and compute them directly within the application. However, some of these 
processes may require a significant amount of time, especially with larger 
datasets. Therefore, it may be advantageous to save the intermediate results, 
such as the downloaded PPI and computed distance scores, for later use within 
the application.

In this vignette, we demonstrate the functionality of 
`r #BiocStyle::Biocpkg("GeDi")` `GeDi` using enrichment analysis results from 
the `r BiocStyle::Biocpkg("macrophage")` dataset. To immediately start exploring
the application, you can simply execute:

```{r examplerun, eval=FALSE}
GeDi()
```

and load the example data with the `Load example data` button in the 
**Data Input** panel. 

Alternatively, you can proceed by following the subsequent code chunks to create 
the necessary input objects, step by step. This can serve as a reference guide 
for the steps ideally executed prior to analyzing the data with 
`r BiocStyle::Biocpkg("GeDi")`.

To utilize `r BiocStyle::Biocpkg("GeDi")`, you'll require results from a 
functional annotation analysis. In this vignette, we'll demonstrate how to 
conduct an enrichment analysis on differentially expressed (DE) genes from the 
`r BiocStyle::Biocpkg("macrophage")` dataset.

Firstly, we'll load the macrophage data and create a `DESeqDataset`, as the 
subsequent differential expression analysis will be performed using 
`r BiocStyle::Biocpkg("DESeq2")` [@Love2014].

```{r create_dds}
# Load required libraries
library("macrophage")
library("DESeq2")

# Load the example dataset "gse" from the "macrophage" package
data("gse", package = "macrophage")

# Create a DESeqDataSet object using the "gse" dataset and define the 
# experimental design.
# We use the condition as part of the experimental design, because we are 
# interested in the differentially expressed genes between treatments. We also 
# add the line to the design to account for the inherent differences between 
# the donors.
dds_macrophage <- DESeqDataSet(gse, design = ~ line + condition)

# Change the row names of the DESeqDataSet object to Ensembl IDs
rownames(dds_macrophage) <- gsub("\\..*", "", rownames(dds_macrophage))

# Have a look at the resulting DESeqDataSet object
dds_macrophage
```

Now that we've obtained our `DESeqDataset`, we can conduct the differential 
expression (DE) analysis. In this vignette, we'll utilize the results from 
comparing two distinct conditions of the dataset, specifically `IFNg` and 
`naive`, while accounting for the cell line of origin.

Before executing the DE analysis, we'll filter out lowly expressed features 
from the dataset. In this instance, we'll exclude all genes with fewer than 10 
counts in at least 6 samples, where 6 corresponds to the smallest group size in 
the dataset.

Subsequently, we'll conduct the DE analysis and assess against a null hypothesis
of a log2FoldChange of 1 to ensure that we identify genes with consistent and 
robust changes in expression.


Finally, we'll append the gene symbols to the resultant `DataFrame`, which will 
later serve as our "Genes" column in the input data for 
`r BiocStyle::Biocpkg("GeDi")`.

```{r create_resde1}
# Filter genes based on read counts
# Calculate the number of genes with at least 10 counts in at least 6 samples
keep <- rowSums(counts(dds_macrophage) >= 10) >= 6

# Subset the DESeqDataSet object to keep only the selected genes
dds_macrophage <- dds_macrophage[keep, ]

# Have a look at the resulting DESeqDataSet object
dds_macrophage
```

```{r create_resde2}
# Perform differential expression analysis using DESeq2
dds_macrophage <- DESeq(dds_macrophage)

# Extract differentially expressed genes
# Perform contrast analysis comparing "IFNg" condition to "naive" condition
# Set a log2 fold change threshold of 1 and a significance level (alpha) of 0.05
res_macrophage_IFNg_vs_naive <- results(dds_macrophage,
  contrast = c("condition", "IFNg", "naive"),
  lfcThreshold = 1, alpha = 0.05
)

# Add gene symbols to the results in a column "SYMBOL"
res_macrophage_IFNg_vs_naive$SYMBOL <- rowData(dds_macrophage)$SYMBOL
```

After completing the differential expression analysis, we move on to conduct 
the functional annotation analysis. To begin, we extract the differentially 
expressed (DE) genes from the previously generated results and identify the 
background genes to be utilized for functional enrichment.

For the enrichment analysis, we use the overrepresentation analysis method 
provided by the `r BiocStyle::Biocpkg("topGO")` package. To streamline the 
integration of these results into `r BiocStyle::Biocpkg("GeDi")`, we utilize the 
`topGOtable` function from the `r BiocStyle::Biocpkg("pcaExplorer")` package. 
By default, this function employs the `BP` ontology and the `elim` method, which
helps decorrelate the Gene Ontology (GO) graph structure, resulting in less 
redundant functional categories. The output is a `DataFrame` object that 
seamlessly integrates with `r BiocStyle::Biocpkg("GeDi")`.

However, as `r BiocStyle::Biocpkg("GeDi")` has only minimal requirements for the
input, enrichment results generated using `r BiocStyle::Biocpkg("clusterProfiler")`
can also be utilized. While we primarily tested results from the `enrichGO` 
method during `r BiocStyle::Biocpkg("GeDi")` development, those from the 
`enrichKEGG` and `enrichPathway` methods are also compatible.


```{r create_resenrich1, eval=TRUE}
# Load required packages for analysis
library("pcaExplorer")
library("GeneTonic")
library("AnnotationDbi")

# Extract gene symbols from the DESeq2 results object where FDR is below 0.05
# The function deseqresult2df is used to convert the DESeq2 results to a 
# dataframe format
# FDR is set to 0.05 to filter significant results
de_symbols_IFNg_vs_naive <- deseqresult2df(res_macrophage_IFNg_vs_naive,
                                           FDR = 0.05)$SYMBOL

# Extract gene symbols for background using the DESeq2 results object
# Filter genes that have nonzero counts
bg_ids <- rowData(dds_macrophage)$SYMBOL[rowSums(counts(dds_macrophage)) > 0]
```

```{r create_resenrich2, eval=TRUE}
# Load required package for analysis
library("topGO")
library("org.Hs.eg.db")

# Perform Gene Ontology enrichment analysis using the topGOtable function from 
# the "pcaExplorer" package
macrophage_topGO_example <-
  pcaExplorer::topGOtable(de_symbols_IFNg_vs_naive,
    bg_ids,
    ontology = "BP",
    mapping = "org.Hs.eg.db",
    geneID = "symbol",
    topTablerows = 500
  )
```

As mentioned earlier, `r BiocStyle::Biocpkg("GeDi")` expects the input to 
contain at least two columns: one named "Genesets" and one named "Genes". While 
this is not strictly mandatory when providing your data interactively during an 
application session, it becomes necessary if you intend to initiate the 
application with your input as parameters (e.g., 
`GeDi(genesets = my_genesets_df)`). In such cases, the "Genesets" column should 
contain identifiers for each geneset in the input, while the "Genes" column 
should consist of comma-separated lists of genes associated with each geneset.

Therefore, we will adjust the column names of the resulting `data.frame` from 
the enrichment analysis to adhere to the required format.

```{r renamecolumns, eval=TRUE}
# Rename columns in the macrophage_topGO_example dataframe
# Change the column name "GO.ID" to "Genesets"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "GO.ID"] <- "Genesets"

# Change the column name "genes" to "Genes"
names(macrophage_topGO_example)[names(macrophage_topGO_example) == "genes"] <- "Genes"
```
 
 

## All set!

Now that we've obtained functional annotation results from the 
`r BiocStyle::Biocpkg("macrophage")` dataset, we can begin exploring the data 
using `r BiocStyle::Biocpkg("GeDi")`. You have two options: you can either launch
the application and supply the generated data using the `GeDi()` command, or if
you've followed this vignette, you can initiate the application directly with
the loaded data by executing `GeDi(genesets = macrophage_topGO_example)`.

```{r dryrun, eval=FALSE}
GeDi()

GeDi(genesets = macrophage_topGO_example)
```

The above shown code will open the application, directing you to the **Welcome**
page. The **Welcome** page of `r BiocStyle::Biocpkg("GeDi")` serves as the entry
point to the application, providing users with an overview of its features and 
functionalities. Upon launching the application, users are greeted with a 
user-friendly interface designed to facilitate the exploration and interpretation
of functional annotation and enrichment analysis results. The **Welcome** page 
offers guidance on how to navigate the application and highlights key components
such as data input options, visualization tools, and interactive features. 
Whether users are new to GeDi or returning to explore additional datasets, 
the **Welcome** page serves as a central hub for accessing resources and getting 
started with their analysis journey.

# Description of the `GeDi` user interface {#userinterface}

The `r BiocStyle::Biocpkg("GeDi")` application, developed with the 
`r BiocStyle::CRANpkg("shiny")` framework, incorporates the modern design 
elements of the `r BiocStyle::CRANpkg("bs4Dash")` package, which is built upon 
Bootstrap 4. This combination of technologies ensures a sleek and visually 
appealing user interface for navigating and interacting with the functionality 
offered by `r BiocStyle::Biocpkg("GeDi")`. By leveraging the features of
`r BiocStyle::CRANpkg("shiny")` and `r BiocStyle::CRANpkg("bs4Dash")`, 
`r BiocStyle::Biocpkg("GeDi")` provides users with an intuitive and 
aesthetically pleasing environment for conducting functional annotation and 
enrichment analyses on their datasets.

## Header (navbar)

The dashboard navbar in `r BiocStyle::Biocpkg("GeDi")`, referred to as such in 
the `r BiocStyle::CRANpkg("bs4Dash")` framework, features a dropdown menu 
accessible by clicking on the respective "info" icon. The menu offers additional
functionality through various buttons:

- The open book icon - This option allows users to explore the 
`r BiocStyle::Biocpkg("GeDi")` vignette, either the version bundled with the 
package or the online version, providing detailed documentation and usage 
guidelines.
- The information i cirle - Selecting this option displays information
about the current session, presenting details such as the R environment and 
loaded packages, helpful for troubleshooting and debugging purposes.
- The heart button - This button offers general information about 
`r BiocStyle::Biocpkg("GeDi")`, including links to its development version for 
contribution and guidelines on citing the tool in research publications.

Besides the two dropdown menus, users can also find the `Bookmark` button in the
Navbar. The `Bookmark` button in the `r BiocStyle::Biocpkg("GeDi")` navbar serves
as a convenient tool for users to save and bookmark genes and genesets of 
interest for later reference. To use this feature, users must first select or 
click on a gene or geneset that they wish to bookmark. Once the desired gene or 
geneset is selected, users can then click on the `Bookmark` button to add it to 
a list of bookmarked items within the `r BiocStyle::Biocpkg("GeDi")` application.
This functionality enables users to organize and revisit specific genes or 
genesets that they find relevant or intriguing during their exploration of 
functional annotation and enrichment analysis results. The bookmarked genes and 
genesets can later be found in the **Report** panel.

## Sidebar

By clicking the menu bar icon on the left side of the app (or simply by moving 
the mouse over to the left side if viewing the app in full screen mode), users 
can activate the sidebar menu. This sidebar menu serves as the primary means of 
accessing the various panels of the `r BiocStyle::Biocpkg("GeDi")` application, 
providing navigation to different functionalities. More detailed explanations of
each panel will be provided in the next section.


## Body

The structure of `r BiocStyle::Biocpkg("GeDi")` is designed around different 
panels, each of which becomes active upon clicking the corresponding icons or 
text in the sidebar.

While the Welcome panel is relatively self-explanatory, additional information 
and explanations are provided for the functionality of the remaining panels. For
new users seeking guidance, there's a question circle button available to 
initiate an interactive tour of `r BiocStyle::Biocpkg("GeDi")`. This tour allows 
users to learn the basic usage mechanisms by actively engaging with the
interface. During the tour, specific elements are highlighted in response to 
user actions, while the rest of the UI remains shaded to maintain focus. Users 
can interrupt the tour at any time by clicking outside the highlighted window,
and navigation between steps is facilitated by arrow buttons (left, right). The 
tour functionality is implemented using the `r BiocStyle::CRANpkg("rintrojs")`
package.

# The `GeDi` functionality {#functionality}

The `r BiocStyle::Biocpkg("GeDi")` `r BiocStyle::Biocpkg("shiny")` application is
organized into distinct panels, each serving a specific purpose, which will be 
thoroughly explored in the following sections.

## The Welcome panel

This panel serves as a guide for utilizing `r BiocStyle::Biocpkg("GeDi")` 
effectively. It offers detailed instructions on generating input data for the 
application, elucidating the expected input format and outlining the various 
interactive elements present in the app's other panels.

```{r welcome-page2, fig.align = "center", fig.cap = "The Welcome panel of GeDi", echo = FALSE}
knitr::include_graphics("Welcome_page.png")
```


## The Data Input panel

This panel serves as a hub for managing data input if it's not provided within 
the function call. It's divided into distinct boxes, each representing a step of
the data input process, which sequentially appear as you successfully complete 
each preceding step.


**Step 1**: Provide your Genesets as input data

In the initial **Step 1** box, you can provide your data by utilizing the
**Browse** button. This action opens a modal window enabling you to select the 
relevant file from your computer storage. After successfully loading the data, a
preview is displayed in the **Genesets preview** box on the right. During this 
step, the application checks if your input contains the "Genesets" and "Genes" 
columns. If these columns are missing, a small error message appears in the lower
right corner. Additionally, two drop-down menus allow you to select the correct 
columns from your data and update the input accordingly.

You also have the option to start using `r BiocStyle::Biocpkg("GeDi")` with 
preprocessed example data based on the `r BiocStyle::Biocpkg("macrophage")` 
dataset. Simply click the **Load demo data** button to load the example data's 
enrichment results. You can explore these results in the **Genesets preview** box.

However, instead of loading demo data and observing the expected data structure 
through the **Genesets preview** box, you can also use the 
**Have a look at the data structure** button. By clicking this button, a modal 
window with a visual representation of the expected input data structure will 
open. This screenshot serves as a helpful guide, providing you with a clear 
understanding of how your data should be formatted for optimal compatibility 
with `r BiocStyle::Biocpkg("GeDi")`.

Once, you have successfully loaded some data, the data input process will proceed
and two additional boxes will be displayed in the panel. 

```{r data-input-step1, fig.align = "center", fig.cap = "The Data input panel - Step 1", echo = FALSE}
knitr::include_graphics("Data_Input_panel_Step1.png")
```


**Optional Filtering Step**: Filter generic genesets

Introducing the first new box, the **Optional Filtering Step** offers a 
non-compulsory yet advantageous opportunity to refine your geneset selection. 
While not obligatory for data exploration, engaging in this step can notably 
optimize downstream processing runtime. Here, you're empowered to filter genesets
within your dataset, thereby enhancing result interpretation. This step enables 
the exclusion of large and generic genesets, contributing to clearer insights. 
Additionally, you have the flexibility to filter genesets based on size criteria.

The box features a histogram illustrating geneset sizes, providing visual context
for the filtering process. Within the interface, two input fields are available 
for customization. The left input field facilitates the selection of individual 
genesets by their identifiers in the "Genesets" column of your dataset.
Meanwhile, the right input field empowers you to establish a threshold "x" for 
filtering genesets with a size greater than or equal to "x." This interactive 
approach ensures tailored filtering suited to your specific analysis requirements.

Once you've chosen the genesets you wish to exclude from your dataset, you can 
initiate the filtering process by clicking the "Remove the selected Genesets" 
button. This action will remove all selected genesets from the dataset. 
Additionally, you have the option to save the filtered data using the "Download 
the filtered data" button. Clicking this button will save the filtered data to 
your local machine. This feature can be particularly beneficial for users who 
intend to revisit their data in a new instance of GeDi and want to ensure that 
previously identified uninsightful genesets have already been filtered out.


Once you've chosen the genesets you wish to exclude from your dataset, you can
initiate the filtering process by clicking the `Remove the selected Genesets`
button. This action will remove all selected genesets from the dataset.  
Additionally, you have the option to save the filtered data using the 
`Download the filtered data` button. Clicking this button will save the filtered 
data to your local machine.This feature can be particularly beneficial for users
who intend to revisit their data in a new instance of 
`r BiocStyle::Biocpkg("GeDi")` and want to ensure that previously identified 
uninsightful genesets have already been filtered out.

```{r optional-filtering, fig.align = "center", fig.cap = "Optional Filtering Step", echo = FALSE}
knitr::include_graphics("Optional_Filtering.png")
```


**Step 2**: Species Selection

Upon advancing to the second box labeled **Step 2**, you'll encounter the crucial
task of selecting the species associated with your dataset. This step holds 
significant importance for the computation of the **pMM score** within 
`r BiocStyle::Biocpkg("GeDi")`, which heavily relies on a 
**Protein-Protein Interaction (PPI)** matrix. This matrix plays a pivotal role in
capturing protein interaction strength, thereby enriching distance scores with 
valuable biological context. To access and utilize this essential information, 
specifying the species linked to your dataset is mandatory. By clicking the
input field, you'll prompt a dropdown menu showcasing preselected species options.
If your species is included, simply make your selection. Alternatively, if your 
species is not listed, you have the option to manually input it. In cases of 
uncertainty, a convenient link provided on the right directs you to the STRING 
database, enabling verification of species details and PPI availability for 
informed decision-making.

```{r species-selection, fig.align = "center", fig.cap = "Species Selection", echo = FALSE}
knitr::include_graphics("Species_Selection.png")
```

**Step 3**: PPI Matrix Download

Following species selection, a third box named **Step 3** will emerge. In this 
phase, you have the opportunity to download the Protein-Protein Interaction (PPI)
matrix. This process may necessitate some time, with a progress bar positioned 
in the lower right corner providing real-time updates on the download status. 
Once the download is complete, you can conveniently preview the PPI matrix
within the **PPI Preview** box situated on the right-hand side of the interface. 
This will show that the PPI consists of three columns: **Gene1** and **Gene2**, 
housing the gene symbols corresponding to the interacting proteins, and a column 
labeled **combined_score**, denoting the confidence level of each interaction. 
The assigned score is derived from the number of known interactions between two
proteins, normalized to the (0, 1) interval utilizing the formula:

$$
\begin{aligned} 
combinedScore = \frac{(\#interaction - min)}{(max - min)} 
\end{aligned}
$$ 

where **min** and **max** represent the minimum and maximum number of
interactions, respectively.

In addition to downloading a PPI matrix during the current session, users can 
also upload a previously saved matrix for analysis using the **Browse** button. 
This functionality allows users to work with their own customized datasets or 
previously analyzed PPI matrices. Furthermore, saving the downloaded PPI matrix 
locally enables users to store the data on their machine for future use. By 
saving the matrix locally via the **Save PPI matrix** button, users can access 
the data quickly in subsequent sessions without having to wait for the download 
process again. This capability significantly enhances workflow efficiency and 
allows for seamless continuation of analysis across different sessions.

```{r download-ppi, fig.align = "center", fig.cap = "Downloading the PPI", echo = FALSE}
knitr::include_graphics("Downloading_PPI.png")
```

While the final two steps are optional, note that the PPI matrix is only 
required for a singular score. Therefore, you can commence data exploration 
without necessarily completing these additional steps.

Upon concluding the essential tasks outlined in this panel, you are ready to 
progress to the **Distance Scores** panel.

## The Distance Scores panel

This panel focuses mainly on computing distance scores for the provided input
data. Like the preceding panel, it is segmented into two distinct sections, 
each serving a specific function.


```{r distance-score, fig.align = "center", fig.cap = "The Distance Score panel", echo = FALSE}
knitr::include_graphics("Distance_Score_panel.png")
```


**Calculating Distance Scores**

In the upper box, titled **Calculate distance scores for your Genesets**, you 
have the flexibility to select from various distance scores for computation. 
This feature provides users with a range of options to tailor the analysis 
according to their specific requirements and preferences. The available scores 
are: 

* **pMM Score**: This score integrates protein-protein interaction (PPI) data 
into the Meet-Min distance. The PPI-weighted Meet-Min (**pMM**) score is defined 
as 

$$
\begin{aligned}
pMM = min(pMM(A -> B), pMM(B -> A))
\end{aligned}
$$
where 

$$
\begin{aligned}
pMM(A -> B) = 1 - \frac{|A \cap B|}{min(|A|, |B|)} - \frac{\alpha}{min(|A|, |B|)} * \sum_{a \in A - B} \frac{w * \sum_{b \in A \cap B} P(a, b) + \sum_{b \in B - A} P(a, b)}{max(P) * (w * |A \cup B| + |B - A|)}  
\end{aligned}
$$
and 

$$
\begin{aligned}
w = \frac{min(|A|, |B|)}{|A| + |B|}
\end{aligned}
$$
$\alpha$ is a scaling factor between 0 and 1. The PPI matrix can be downloaded 
from the **Data Input** panel. More details can be found in the paper by Yoon 
et al [@Yoon2019].

* **Kappa Score**: The **Kappa** distance is a set-based metric based on observed
and expected agreement rates between two genesets. It is defined as 

$$
\begin{aligned}
Kappa = 1 - \frac{O - E}{1 - E}
\end{aligned}
$$

where 

$$
\begin{aligned}
O = \frac{|A \cap B| + |A \cup B|^c}{U} \\
E = \frac{|A| |B| + |A^c| |B^c|}{|U|^2}
\end{aligned}
$$

U is the set of all unique genes in the data. In this application the Kappa 
distance is additionally normalized to the (0, 1) interval to make it comparable
to the remaining distance metrics.

* **Jaccard Score**: The **Jaccard** distance uses the Jaccard coefficient, which
is transformed into a distance metric by subtracting it from 1. It is defined as 

$$
\begin{aligned}
Jaccard = 1 - \frac{|A \cap B|}{|A \cup B|}
\end{aligned}
$$


* **Meet-Min Score**: The **Meet-Min** (MM) distance transforms the overlap 
coefficient into a distance measure by subtracting it from 1.The overlap 
coefficient is a similarity measure which is defined as 

$$
\begin{aligned}
OC = \frac{|A \cap B|}{min(|A|, |B|)}
\end{aligned}
$$

In order to transform this measure of similarity into a measure of distance, 
the overlap coefficient is subtracted from 1, resulting in the calculation of 
the Meet-Min (MM) distance as 

$$
\begin{aligned}
MM = 1 - \frac{|A \cap B|}{min(|A|, |B|)}
\end{aligned}
$$

As a solely set based measurement, the Meet-Min distance only takes the 
composition of the genesets into account but not the underlying biological 
information inherent in the genesets.

* **Sorensen-Dice**: The **Sorensen-Dice** distance uses the Sorensen-Dice
coefficient, which is transformed into a distance metric by subtracting it 
from 1. It is defined as

$$
\begin{aligned}
Sorensen-Dice(A, B) = 1 - \frac{2 * |A \cap B|}{|A| + |B|}
\end{aligned}
$$
As a solely set based measurement, the Sorensen-Dice distance only takes the 
composition of the genesets into account but not the underlying biological 
information inherent in the genesets.

* **GO distance**: The **GO distance** score measures the relationship between 
gene sets that are represented by GO terms. Implemented in the 
`r BiocStyle::Biocpkg("GOSemSim")` Rpackage, there are two main types: 
information content (IC)-based methods(e.g., Resnik, Lin, Schlicker, and Jiang) 
and graph-based methods (e.g., Wang). These methods compute similarity scores 
based on shared characteristics, such as the most informative common ancestor 
in IC-based methods or the hierarchical structure of the GO database in 
graph-based methods. To integrate these scores into distance-based analyses, 
the similarity scores are converted into distance scores by subtracting the 
similarity score from 1. This transformation ensures compatibility with other 
distance metrics used in `r BiocStyle::Biocpkg("GeDi")`. While applicable only 
to GO terms, this approach is particularly useful in gene function analyses.

Each scoring method possesses its own set of advantages and drawbacks, 
underscoring the importance of selecting one that suits your dataset 
characteristics and analysis goals. Upon choosing a score, the 
**Compute the distances between genesets** button appears on the on the right 
side. Clicking this button initiates the scoring procedure, which may require 
some time to execute, particularly for larger datasets. To monitor the progress 
of this operation, refer to the progress bar located in the lower right corner 
of the panel. Once the scoring process concludes, you can delve into the 
**Geneset Distance Scores** box to explore a variety of visual representations 
of your data.


**Distance Scores Visualizations**

* **Distance Scores Heatmap**: The initial visualization offered is a heatmap 
illustrating the distribution of distance scores. Activation of the heatmap 
generation is triggered by clicking the **Calculate Distance Score Heatmap** 
button. Following computation, users can interact with the heatmap by hovering 
over it, revealing the involved genesets and their corresponding scores. 
Additionally, users can zoom in on specific areas of interest. To reset the 
zoomed view, a simple click outside the heatmap area suffices.

* **Distance Scores Dendrogram**: The second visualization provided is a 
dendrogram showcasing individual distance scores. Hierarchical clustering is 
employed to generate the dendrogram, which effectively groups genesets exhibiting 
the highest similarity. To enhance the dendrogram's presentation, users can 
select different combination methods using the drop-down menu located on the 
left side.

* **Distance Scores Graph**: The final visualization available is the network 
representation of distance scores. In this representation, nodes/genesets with 
scores below a predefined threshold are connected by edges. By default, the 
threshold is set to 0.3, but users can adjust it via the slider located on the
left. This interactive graph allows users to hover over or click on nodes to 
highlight connected nodes and obtain additional information about genesets upon
selection. Furthermore, users can search for specific genesets using the input 
field on the left, with the selected geneset being subsequently highlighted in 
the graph. The **Graph metrics** table at the bottom of this box contains various
metrics pertaining to the graph, such as degree, betweenness, harmonic centrality,
clustering coefficient, and input data. This tabulated information serves to 
provide users with valuable insights into the underlying data and distance scores.

**Bookmarking from the this panel:** 

As users navigate through the distance scores of genesets in this section, they
may encounter genesets and interactions that capture their interest and merit 
further investigation. To aid in preserving these noteworthy genesets for later
exploration, you can utilize the **Bookmark** button situated in the Navbar. 
Upon clicking this button, the selected geneset will be added to the list of 
bookmarked genesets within the **Report** panel. Additionally, informative 
messages displayed in the lower right corner will guide users through the 
bookmarking process.

Once you've finished exploring the distance scores, you can proceed to the 
**Clustering graph** panel.

## The Clustering Graph panel

This panel is dedicated to the computation of clusters among genesets based on 
their similarity, which is derived from the previously calculated distance 
scores. Similar to the preceding panel, it comprises two distinct boxes. Within 
these boxes, users can access functionalities to determine and visualize 
clusters of genesets that exhibit comparable characteristics or functions.

The computation of clusters involves grouping genesets that display similar 
patterns of distance scores, thereby indicating shared biological characteristics
or functional relationships. This clustering process enables users to identify 
cohesive groups of genesets with related functionalities or involvement in 
similar biological processes.

```{r clustering-graph, fig.align = "center", fig.cap = "The Clustering Graph panel", echo = FALSE}
knitr::include_graphics("Clustering_Graph_Panel.png")
```


**Choosing a Clustering algorithm**

The upper box, labeled **Select the clustering method**, provides a selection of
distinct clustering algorithms. Users can explore various options to find the 
most suitable algorithm for their analysis:

* **Louvain**: The Louvain algorithm, a prevalent tool in biological network 
analysis, seeks to divide graph nodes into clusters to optimize the modularity 
metric. This metric gauges the strength of connections within clusters relative 
to those between clusters. Consequently, nodes within the same cluster exhibit 
greater similarity to one another than to nodes outside the cluster. This 
clustering approach aims to enhance data interpretation by grouping similar 
genesets together. Users can adjust a slider in the bottom left corner of the
box to set a similarity threshold, determining when genesets are considered 
similar based on distance scores.  

* **Markov**: The Markov algorithm, commonly employed in biological network 
analysis, is designed to pinpoint densely interconnected regions within graphs.
These regions frequently align with communities or clusters in the graph 
structure. Users can utilize a slider located in the bottom left corner of the 
box to specify a similarity threshold, determining when genesets are deemed 
similar based on distance scores. 

* **Fuzzy clustering**: The Fuzzy Clustering algorithm is a computational 
technique used to partition data points into clusters based on their similarity,
while allowing for data points to belong to multiple clusters with varying 
degrees of membership. It operates through distinct steps and requires the 
specification of different thresholds. Firstly, the **Similarity threshold** is 
set to determine if two genesets exhibit sufficient similarity to be potentially 
clustered together. Secondly, the **Membership threshold** dictates how many 
members of a potential cluster must possess a close relationship, defined by a 
distance score less than or equal to the similarity threshold, for the cluster 
to persist. Lastly, the **Clustering threshold** determines whether two clusters
will be merged. Clusters are merged if their percentage of overlap meets or 
exceeds the clustering threshold. Users can adjust all thresholds using sliders
provided in the interface.

* **PAM**: The PAM (Partitioning around Mendoids) clustering algorithm 
partitions nodes into k distinct clusters, where **k** is a user-defined 
parameter. The algorithm iteratively assigns each node to the nearest cluster 
center based on calculated distance  scores, and then updates the cluster 
centers to minimize the overall variance within each cluster. Users can 
specify the number of clusters, **k**, using a  slider in the interface, 
allowing them to tailor the clustering process to the needs of their analysis.
Adjusting the value of k enables the exploration of different clustering 
granularities, providing flexibility in interpreting the data and identifying 
meaningful patterns.

Once you choose a method, you can start the cluster calculation via the 
**Cluster the Genesets** button on the right. Keep in mind that this step might 
take some time, especially for larger datasets. Look for the progress bar in the
lower right corner for updates on the scoring status.

Once the clusters are calculated, you can explore various visualizations of your
data in the **Geneset Cluster Graphs** box.

**Cluster Visualizations**

* **Geneset Graph**: In the **Geneset Graph**, clusters are visualized as a graph,
with individual genesets serving as nodes and edges connecting genesets within 
the same cluster. To highlight specific nodes, utilize the **Select by id** 
feature on the left, or choose to highlight entire clusters by selecting the 
respective option under **Select by cluster**. Please note that only genesets 
belonging to at least one cluster will be displayed in this graph. For additional
insights, nodes can be colored based on specific parameters from your input data,
accessible through the **Color the graph** by dropdown menu. Depending on the 
information provided with your data, various options will be available. While 
interacting with the network, nodes can be moved by clicking and dragging them 
to desired locations, offering flexibility in managing node placement which is 
particularly useful in complex or densely populated graphs.

* **Cluster-Geneset Bipartite Graph**: The **Cluster-Geneset Bipartite Graph** 
presents a bipartite representation of the clusters. In this visualization, nodes
represent both clusters and genesets, with edges connecting cluster nodes to 
their corresponding geneset members. Hovering over nodes provides additional data
insights. Cluster nodes display the members within each cluster, while geneset 
nodes showcase the genes associated with each geneset.

* **Cluster Enrichment Terms Word Cloud**: The 
**Cluster Enrichment Terms Word Cloud** displays the most frequently occurring 
terms for each cluster. This visualization proves particularly useful when your
data includes brief descriptions of the genesets, in addition to the mandatory 
input data. By utilizing the **Select a cluster** drop-down menu, you can 
designate the cluster of interest. Furthermore, hovering over the word cloud 
enables you to select individual terms and view the frequency with which each 
term appears in the descriptions of the genesets within that cluster.

* **Clustering graph summaries**: The cluster information is also summarized in 
a table-like format in the **Clustering graph summaries** box. This table 
displays each geneset alongside the cluster to which it belongs. Additionally, 
the table features a search function, facilitating the quick retrieval of a 
geneset of interest.

**Bookmarking from this panel:**

While exploring the **Clustering Graph panel**, users may encounter genesets 
and clusters that intrigue them and warrant further investigation. To facilitate
the preservation of these notable genesets and clusters for future exploration,
users can utilize the **Bookmark** button located in the Navbar. Clicking this 
button will add the selected geneset or cluster to the list of bookmarked items 
within the Report panel. Helpful messages displayed in the lower right corner 
will assist users throughout the bookmarking process.

In order to bookmark interesting genes and clusters, users simply select a 
geneset or cluster from the Geneset Graph or Cluster-Geneset Bipartite Graph 
and use the Bookmark button to add the respective information to the set of 
bookmarked features. 
After exploring the results in the **Clustering Graph panel**, users can proceed 
to the **Report** panel to have a look at the bookmarked genesets and clusters 
or iterate through the individual panels of the app for a more in depth 
exploration of the data.

## The Report panel
In this panel of the application, users can obtain a comprehensive overview of 
the items they have bookmarked for further exploration. On the left side of the 
interface, bookmarked genesets are listed, while bookmarked clusters are 
displayed on the right side.

During an interactive exploration session, recalling specific details about each 
bookmarked item can sometimes be challenging. Therefore, users are provided with
convenient options to manage their bookmarked data.

Below the interactive tables displaying bookmarked genesets and clusters, users 
can find buttons allowing them to download the content of each table individually.
Additionally, the **Start the generation of the report** button is provided to 
generate a detailed report encompassing all selected elements of interest.

The report generation process utilizes a predefined template report included 
within the `r BiocStyle::Biocpkg("GeDi")` package. This template leverages the 
input elements and reactive values associated with the bookmarks, ensuring that 
the generated report contains comprehensive and relevant information.

The resulting report serves as a valuable tool for creating a permanent and 
reproducible analysis output. Users can easily store or share this report for 
future reference or collaboration purposes.

```{r report-panel, fig.align = "center", fig.cap = "The Report panel", echo = FALSE}
knitr::include_graphics("Report_panel.png")
```


# Additional Information {#additionalinfo}

If you have questions about the package or the available functionality, please 
submit them on the Bioconductor [support site](https://support.bioconductor.org/)
using the tag 'GeDi'.

Bug reports can be opened as issues in the  `r BiocStyle::Biocpkg("GeDi")` 
[GitHub repository](https://github.com/AnnekathrinSilvia/GeDi/issues).
Please note that the GitHub repository also hosts the development version of the 
package, where new functionality is continuously added - be cautious, as you may 
be working with cutting-edge versions!

The authors welcome thoughtful suggestions for enhancements or new features, and 
even better, pull requests.

# Additional example data

In this section, we present additional examples demonstrating the versatility of
`r BiocStyle::Biocpkg("GeDi")` in analyzing functional enrichment data from two 
widely used databases, KEGG and Reactome. By leveraging the rich resources provided
by these databases, GeDi offers researchers a comprehensive toolkit for exploring
and interpreting complex biological pathways and processes. Through step-by-step 
demonstrations, we illustrate how GeDi can seamlessly integrate with data from 
KEGG and Reactome, enabling users to gain deeper insights into the functional 
annotations of their gene sets. Whether investigating specific pathways or broader
biological processes, GeDi provides intuitive and powerful functionalities to 
enhance the analysis of functional enrichment data from diverse sources.

In this section we will demonstrate how results containing identifiers from 
databases like KEGG [@Kanehisa2023] or Reactome [@Gillespie2022] - e.g. generated
using `enrichKegg` or `enrichPathway` functions from the 
`r BiocStyle::Biocpkg("clusterProfiler")` package - can be utilized as input for
`r BiocStyle::Biocpkg("GeDi")`. We will again use the data of the 
`r BiocStyle::Biocpkg("macrophage")` package, specifically the differentially
expressed genes we have identified before. With this data, we demonstrate how to
generate the results and prepare them for their use in 
`r BiocStyle::Biocpkg("GeDi")`.

However, before we can use the `enrichKEGG` function from the 
`r BiocStyle::Biocpkg("clusterProfiler")` package, we have to map the ENSEMBL ids
of the data to Entrez ids. For this, we will up the first use the 
`r BiocStyle::Biocpkg("biomaRt")` package to generate a mapping of ENSEMBL to 
Entrez.

```{r withbiomart, eval = FALSE}
# Load the "biomaRt" package to access the BioMart database
library("biomaRt")

# Set up a connection to the ENSEMBL BioMart database for human genes
mart <-
  useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

# Retrieve gene annotations using the BioMart database
anns <- getBM(
  attributes = c(
    "ensembl_gene_id",
    "external_gene_name",
    "entrezgene_id",
    "description"
  ),
  filters = "ensembl_gene_id",
  values = rownames(dds_macrophage),
  mart = mart
)

# Match the retrieved annotations to the genes in dds_macrophage
anns <- anns[match(rownames(dds_macrophage), anns[, 1]), ]
```

Next, we map the differentially expressed genes to get the right identifiers and
run the `enrichKEGG` function. We set the organism to human and the p-value 
cutoff to 5%.

```{r enrichKegg, eval = FALSE}
# Load the "clusterProfiler" package for functional enrichment analysis
library("clusterProfiler")

# Retrieve Entrez gene IDs from the annotations data frame based on matching 
# Ensembl gene IDs from the DE results
genes <- anns$entrezgene_id[match(rownames(res_macrophage_IFNg_vs_naive),
                                  anns$ensembl_gene_id)]

# Perform KEGG pathway enrichment analysis using the retrieved gene IDs
res_enrich <- enrichKEGG(genes,
  organism = "hsa",
  pvalueCutoff = 0.05
)
```

We can now use the results of the enrichment in `r BiocStyle::Biocpkg("GeDi")`.
For this, we directly start the app with the loaded data. If you have not computed 
the data following this workflow, you can beforehand load it from the available 
data in this package.  

```{r GeDi_Kegg, eval = FALSE}
# Load the "macrophage_KEGG_example" dataset from the "GeDi" package
data("macrophage_KEGG_example", package = "GeDi")

# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_KEGG_example" 
# dataset
GeDi(genesets = macrophage_KEGG_example)
```

In a similar manner we can use the Reactome database for the functional annotation.
Here, we use the 
`r BiocStyle::Biocpkg("ReactomePA")` package and the differentially expressed 
genes. 

```{r enrichReactome, eval = FALSE}
# Load the "ReactomePA" package for pathway enrichment analysis
library("ReactomePA")

# Perform pathway enrichment analysis using the "enrichPathway" function
reactome <- enrichPathway(genes,
  organism = "human",
  pvalueCutoff = 0.05,
  readable = TRUE
)
```

Now we can use the results in the same manner as for the KEGG pathway analysis. 

```{r GeDi_Reactome, eval = FALSE}
# Load the "macrophage_Reactome_example" dataset from the "GeDi" package
data("macrophage_Reactome_example", package = "GeDi")

# Start the GeDi app with the loaded data
# The "genesets" parameter is set to the loaded "macrophage_Reactome_example" 
# dataset
GeDi(genesets = macrophage_Reactome_example)
```



# FAQs {#faqs}

**Q: My configuration on two machines is somewhat different, so I am having difficulty in finding out what packages are different. Is there something to help on this?**

A: Yes, you can check out `r BiocStyle::Githubpkg("federicomarini/sessionDiffo")`,
a small utility to compare the outputs of two different `sessionInfo` outputs.
This can help you pinpoint what packages might be causing the issue.

**Q: I am using a different service/software for generating the results of functional enrichment analysis. How do I plug this into `GeDi`?**

A: You can use nearly any result of a functional enrichment analysis in 
`r BiocStyle::Biocpkg("GeDi")` as long as the results are transformed in a way 
that they fit the input requirements. Please check out the **Welcome** page to 
see the specification of the input requirements. 


# Session Info {- .smaller}

```{r sessioninfo}
utils::sessionInfo()
```

# References {-}