--- title: "Gene Set Enrichment Analysis with ggpicrust2" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Gene Set Enrichment Analysis with ggpicrust2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction This vignette demonstrates how to perform Gene Set Enrichment Analysis (GSEA) on PICRUSt2 predicted functional data using the ggpicrust2 package. GSEA is a powerful method for interpreting gene expression data by focusing on gene sets (pathways) rather than individual genes. In the context of microbiome functional prediction, GSEA can help identify pathways that are enriched in different conditions. ## Installation First, make sure you have the necessary packages installed: ```{r setup, eval=FALSE} # Install ggpicrust2 if (!requireNamespace("ggpicrust2", quietly = TRUE)) { devtools::install_github("cafferychen777/ggpicrust2") } # Install required Bioconductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install(c("limma", "fgsea", "clusterProfiler", "enrichplot", "DOSE", "pathview")) # Load the package library(ggpicrust2) library(dplyr) library(ggplot2) ``` ## Method Selection Guide The `pathway_gsea()` function supports multiple methods for gene set enrichment analysis: | Method | Type | Covariate Support | Description | |--------|------|-------------------|-------------| | `camera` (default) | Competitive | ✅ Yes | **Recommended.** Accounts for inter-gene correlations, providing more reliable p-values | | `fry` | Self-contained | ✅ Yes | Fast rotation test, efficient for large gene set collections | | `fgsea` | Preranked | ❌ No | Fast preranked GSEA, but p-values may be unreliable | | `clusterProfiler` | Preranked | ❌ No | Traditional GSEA implementation | **Why use camera/fry?** Wu et al. (2012) demonstrated that preranked GSEA methods can produce "spectacularly wrong p-values" due to not accounting for inter-gene correlations. The `camera` and `fry` methods from limma address this limitation. ## Basic GSEA Analysis (Recommended: camera method) Let's start with a basic GSEA analysis using the recommended `camera` method: ```{r basic-gsea, eval=FALSE} # Load example data data(ko_abundance) data(metadata) # Prepare abundance data abundance_data <- as.data.frame(ko_abundance) rownames(abundance_data) <- abundance_data[, "#NAME"] abundance_data <- abundance_data[, -1] # Run GSEA analysis with camera (recommended) gsea_results <- pathway_gsea( abundance = abundance_data, metadata = metadata, group = "Environment", pathway_type = "KEGG", method = "camera", min_size = 5, max_size = 500, p.adjust = "BH" ) # View the top results head(gsea_results) ``` ## Covariate Adjustment One of the most powerful features of the `camera` and `fry` methods is the ability to adjust for confounding variables. This is particularly important in microbiome studies where factors like age, sex, BMI, and batch effects can influence results. ```{r covariate-gsea, eval=FALSE} # Example with covariate adjustment # Assuming metadata has columns: group, age, sex, BMI gsea_results_adjusted <- pathway_gsea( abundance = abundance_data, metadata = metadata, group = "Disease", covariates = c("age", "sex", "BMI"), # Adjust for these confounders pathway_type = "KEGG", method = "camera" ) # The results now reflect the group effect after adjusting for confounders head(gsea_results_adjusted) ``` ## Fast Analysis with fry For large datasets or when computational efficiency is important, use the `fry` method: ```{r fry-gsea, eval=FALSE} # Fast rotation gene set test gsea_results_fry <- pathway_gsea( abundance = abundance_data, metadata = metadata, group = "Environment", pathway_type = "KEGG", method = "fry", min_size = 5, max_size = 500 ) head(gsea_results_fry) ``` ## Legacy: Preranked GSEA (fgsea) For compatibility with existing workflows, preranked GSEA methods are still available: ```{r fgsea, eval=FALSE} # Note: Preranked methods don't account for inter-gene correlations # P-values may be less reliable (Wu et al., 2012) gsea_results_fgsea <- pathway_gsea( abundance = abundance_data, metadata = metadata, group = "Environment", pathway_type = "KEGG", method = "fgsea", rank_method = "signal2noise", nperm = 1000, min_size = 10, max_size = 500, p.adjust = "BH", seed = 42 ) # View the top results head(gsea_results_fgsea) ``` ## Annotating GSEA Results To make the results more interpretable, we can annotate them with pathway names and descriptions: ```{r annotate-gsea, eval=FALSE} # Annotate GSEA results annotated_results <- gsea_pathway_annotation( gsea_results = gsea_results, pathway_type = "KEGG" ) # View the annotated results head(annotated_results) ``` ## Visualizing GSEA Results The ggpicrust2 package provides several visualization options for GSEA results. The `visualize_gsea()` function automatically detects whether pathway names are available (from `gsea_pathway_annotation()`) and uses them for better readability, falling back to pathway IDs if names are not available. ### Pathway Label Options The `visualize_gsea()` function offers flexible pathway labeling: ```{r pathway-labels, eval=FALSE} # Option 1: Use raw GSEA results (shows pathway IDs) plot_with_ids <- visualize_gsea( gsea_results = gsea_results, plot_type = "barplot", n_pathways = 10 ) # Option 2: Use annotated results (automatically shows pathway names) plot_with_names <- visualize_gsea( gsea_results = annotated_results, plot_type = "barplot", n_pathways = 10 ) # Option 3: Explicitly specify which column to use for labels plot_custom_labels <- visualize_gsea( gsea_results = annotated_results, plot_type = "barplot", pathway_label_column = "pathway_name", n_pathways = 10 ) # Compare the plots plot_with_ids plot_with_names plot_custom_labels ``` ### Barplot ```{r barplot, eval=FALSE} # Create a barplot of the top enriched pathways barplot <- visualize_gsea( gsea_results = annotated_results, plot_type = "barplot", n_pathways = 20, sort_by = "p.adjust" ) # Display the plot barplot ``` ### Dotplot ```{r dotplot, eval=FALSE} # Create a dotplot of the top enriched pathways dotplot <- visualize_gsea( gsea_results = annotated_results, plot_type = "dotplot", n_pathways = 20, sort_by = "p.adjust" ) # Display the plot dotplot ``` ### Enrichment Plot ```{r enrichment-plot, eval=FALSE} # Create an enrichment plot for a specific pathway enrichment_plot <- visualize_gsea( gsea_results = annotated_results, plot_type = "enrichment_plot", n_pathways = 10, sort_by = "NES" ) # Display the plot enrichment_plot ``` ### Ridge Plot Ridge plots (also known as joy plots) provide a powerful way to visualize the distribution of gene abundances or fold changes within enriched pathways. This helps understand whether pathways are predominantly up- or down-regulated. ```{r ridge-plot, eval=FALSE} # Create a ridge plot for GSEA results # Note: Requires ggridges package to be installed ridge_plot <- pathway_ridgeplot( gsea_results = gsea_results, abundance = abundance_data, metadata = metadata, group = "Environment", pathway_type = "KEGG", n_pathways = 10, sort_by = "p.adjust", show_direction = TRUE, colors = c("Down" = "#3182bd", "Up" = "#de2d26") ) # Display the plot ridge_plot ``` The ridge plot shows: - Each pathway as a density ridge - Color indicates enrichment direction (Up = red, Down = blue) - The distribution of log2 fold changes for genes within each pathway - A vertical dashed line at 0 for reference ## Comparing GSEA and DAA Results It can be informative to compare the results from GSEA with those from Differential Abundance Analysis (DAA): ```{r compare-gsea-daa, eval=FALSE} # Run DAA analysis daa_results <- pathway_daa( abundance = abundance_data, metadata = metadata, group = "Environment", daa_method = "ALDEx2" ) # Annotate DAA results annotated_daa_results <- pathway_annotation( pathway = "KO", daa_results_df = daa_results, ko_to_kegg = TRUE ) # Compare GSEA and DAA results comparison <- compare_gsea_daa( gsea_results = annotated_results, daa_results = annotated_daa_results, plot_type = "venn", p_threshold = 0.05 ) # Display the comparison plot comparison$plot # View the comparison results comparison$results ``` ## Conclusion Gene Set Enrichment Analysis provides a complementary approach to Differential Abundance Analysis for interpreting PICRUSt2 predicted functional data. By focusing on pathways rather than individual features, GSEA can help identify biologically meaningful patterns that might be missed by traditional methods. ### Key Recommendations 1. **Use `camera` method (default)** for most analyses - it provides more reliable p-values by accounting for inter-gene correlations 2. **Adjust for covariates** when you have confounding factors like age, sex, or BMI in your study design 3. **Use `fry` method** when you need faster computation for large gene set collections 4. **Use preranked methods (`fgsea`)** only when you need compatibility with existing workflows, and interpret p-values with caution ### References - Wu, D., & Smyth, G. K. (2012). Camera: a competitive gene set test accounting for inter-gene correlation. *Nucleic Acids Research*, 40(17), e133. - Wu, D., et al. (2010). ROAST: rotation gene set tests for complex microarray experiments. *Bioinformatics*, 26(17), 2176-2182. The ggpicrust2 package now offers a comprehensive suite of tools for both DAA and GSEA analyses, making it easier to gain insights from microbiome functional prediction data.