funOmics

Elisa Gómez de Lope

2024-02-16

Introduction

The funOmics R package is a collection of functions designed to aggregate omics data into higher-level functional representations such as pathways, protein complexes, and cellular locations. This vignette provides a detailed guide on how to use the package.

Omics data analysis is a critical component of modern biomedical research. The funOmics package provides a tool for aggregating omics data from high-throughput experiments (e.g. transcriptomics, metabolomics, proteomics) into higher-level functional activity scores that can then be used for further analysis and modeling. This capability provides a more global view of the biological systems, reduces the dimensionality, and facilitates biological interpretation of results.

The package provides different pooling operators, such as aggregation statistics (mean, median, standard deviation, min, max), dimension-reduction derived scores (PCA, NMF, MDS, pathifier deregulation scores from the pathifier package), or test statistics (t-test, Wilcoxon test, Kolmogorov–Smirnov test) with options for adjusting parameters and settings to suit specific research questions and data types. The package is also well-documented, with detailed descriptions of each function and several examples of usage.

funOmics distinguishes itself from existing Bioconductor packages dedicated to pathway or gene set analysis such as GSEA and ORA (clusterProfiler, fgsea, GSEAset), or GSVA, by offering a comprehensive tool for directly aggregating diverse omics data types into higher-level functional representations, allowing the analysis of such functional representations as functional activity scores that can be modeled as input features for identifying candidate biomarkers, or in clustering strategies for patient identification. Unlike GSEA and ORA, which primarily focus on gene expression and predefined gene sets, funOmics accommodates various omics modalities (e.g., metabolomics, transcriptomics, proteomics), and allows users to define custom molecular sets for aggregation. Additionally, funOmics goes beyond GSVA by providing flexibility in the choice of aggregation operators, enabling users to derive interpretable functional activity scores tailored to their specific research questions. By offering a flexible and user-friendly, alternative tool for functional analysis, funOmics aims to contribute to the diverse array of Bioconductor packages and enhance the capabilities of the community.

Installation

Install funOmics from Bioconductor (release 3.19 onwards) via:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("funOmics")

or the pre-release and latest development version from GitHub:

if (!require("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("elisagdelope/funOmics") 

Usage

Loading the Package

To use the funOmics R package, load it with the following command:

library(funOmics)
## 

Main Function: summarize_pathway_level

You can then access the main function provided by the package, summarize_pathway_level with the type of pooling operator desired to be applied for each molecular set. This function has several options for adjusting parameters and settings to suit specific research questions and data types. The available aggregation operators and other parameters options are described in detail in the package documentation. You can also see the documentation for this function with the command:

?funOmics::summarize_pathway_level

Let’s see two examples of usage, with a simulated gene expression matrix X of dimensions g*s (g genes and s samples), and a list of 100 gene sets pathways. The expression values are random positive values sampled from a standard normal distribution. These examples mimic gene expression data and pathway gene sets, but funOmics can be used to aggregate other types of omics data and molecular sets, such as metabolomics or proteomics. Go to section Real data: Where to find molecular sets?, to see where to find real molecular sets for different types of omics. Note that if your actual omics data is in SummarizedExperiment format, the assay matrix can easily be retrieved as a matrix format (assay(SEobject), see Example 3).

Generate simulation data

# Example usage:
set.seed(1)
g <- 10000
s <- 20
X <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(paste0("g", 1:g), paste0("s", 1:s)))
pathways <- as.list(sample(10:100, size = 100, replace = TRUE))
pathways <- lapply(pathways, function(s, g) paste0("g", sample(1:g, size = s, replace = FALSE)), g)
names(pathways) <- paste0("pathway", seq_along(pathways))
print(paste("Dimensions of omics matrix X:", dim(X)[1], "*", dim(X)[2], "; Length of molecular sets list:", length(pathways)))
## [1] "Dimensions of omics matrix X: 10000 * 20 ; Length of molecular sets list: 100"
print(head(X))
##           s1        s2        s3        s4        s5        s6        s7
## g1 0.6264538 0.8043316 0.2353485 0.6179223 0.2212571 0.5258908 0.3413341
## g2 0.1836433 1.0565257 0.2448250 0.8935057 0.3517935 0.4875444 0.4136665
## g3 0.8356286 1.0353958 0.6421869 0.4277562 0.1606019 1.1382508 0.1220357
## g4 1.5952808 1.1855604 1.9348085 0.2999012 0.1240523 1.2151344 1.5893806
## g5 0.3295078 0.5004395 1.0386957 0.5319833 0.6598739 0.4248307 0.7874385
## g6 0.8204684 0.5249887 0.2835501 1.7059816 0.5038493 1.4508403 1.5920640
##            s8         s9        s10       s11          s12       s13       s14
## g1 1.00203611 1.55915937 0.09504307 0.7914415 0.0145724495 0.8663644 0.1604426
## g2 0.02590761 0.20166217 0.38805939 0.3921679 1.7854043337 0.9476952 0.9241849
## g3 0.44814178 1.04017610 2.13657003 0.4726670 0.0002997544 0.4522428 1.5561751
## g4 0.84323332 0.07195772 0.55661945 0.4579517 0.4356948690 0.2782408 0.8812202
## g5 0.21846310 0.01526544 0.59094164 0.1681319 1.4076452475 1.4175945 0.5263595
## g6 0.47678629 0.33938598 1.52014345 0.5856737 0.6929698698 0.6329981 0.4627372
##            s15       s16       s17       s18        s19        s20
## g1 0.249371112 1.2520152 2.3150930 0.4414410 0.82485558 1.37086468
## g2 0.335346796 0.3351313 1.0603800 0.4130862 0.74402087 0.54569610
## g3 0.004405287 0.1080678 0.3970672 0.8660777 0.69009734 1.62446330
## g4 0.986768348 0.4717051 0.4840034 2.2615708 1.76900681 0.06247283
## g5 0.543575705 2.5070607 1.3584146 0.1787018 0.55215640 0.57021255
## g6 0.626142823 1.2451344 0.6370574 0.4713582 0.03257056 0.31350574
print(head(pathways))
## $pathway1
##  [1] "g6248" "g3566" "g2064" "g8249" "g8532" "g5047" "g4018" "g7193" "g2574"
## [10] "g3728" "g4493" "g4394" "g9017" "g2334" "g7652" "g9604" "g9035" "g4100"
## [19] "g408"  "g9042" "g1229" "g2484" "g7442" "g7239" "g3059" "g811"  "g509" 
## [28] "g2182" "g8420" "g5947" "g1425" "g7218" "g7758" "g1236" "g6966" "g9457"
## [37] "g2668" "g9397" "g7394" "g2975" "g4304" "g9533" "g5287" "g5723" "g6432"
## [46] "g3826" "g3774" "g5906" "g6948" "g295"  "g6675" "g2471" "g4504" "g5162"
## [55] "g3110" "g1901" "g7837" "g5189" "g4422" "g2063" "g7950" "g2301" "g1107"
## [64] "g856"  "g6523" "g6524" "g1880" "g1675" "g7082" "g9914" "g2523" "g6892"
## [73] "g3631" "g8880" "g5560" "g6779" "g5713" "g5485" "g6659" "g6694" "g9592"
## [82] "g2434" "g3335" "g7725" "g3508" "g9399" "g7404" "g7160" "g9628" "g9345"
## 
## $pathway2
##  [1] "g1046" "g9504" "g6237" "g4026" "g2799" "g1764" "g8273" "g9165" "g6322"
## [10] "g9264" "g6996" "g3352" "g6635" "g2801" "g4064" "g2936" "g1823" "g2693"
## [19] "g3927" "g9267" "g3880" "g4245" "g2254" "g1570" "g6507" "g7158" "g7553"
## [28] "g292"  "g7602" "g7571" "g2353" "g3741" "g5694" "g706"  "g9877" "g6889"
## [37] "g4524" "g9776" "g4015" "g3747" "g6639" "g2807" "g2101" "g6546" "g8113"
## [46] "g847"  "g9948" "g6219" "g4743" "g7975" "g466"  "g869"  "g5913" "g8362"
## [55] "g7665" "g838"  "g8119" "g167"  "g2999" "g2484" "g8541" "g7628" "g8437"
## [64] "g6932" "g3017" "g6458" "g8875" "g7429" "g6948" "g8641" "g2004" "g338" 
## [73] "g9654" "g1781" "g5126" "g9549" "g469"  "g5229" "g115"  "g7608" "g2606"
## [82] "g3479" "g5674"
## 
## $pathway3
##  [1] "g9331" "g5832" "g5311" "g6300" "g4939" "g3028" "g8170" "g277"  "g2162"
## [10] "g1887" "g1011" "g9797" "g6396" "g7073" "g3353" "g6757" "g3167" "g3991"
## [19] "g3457" "g7550" "g5707" "g5096" "g7510" "g1334" "g6667" "g7433" "g7005"
## [28] "g3474" "g6809" "g9081" "g6952" "g3525" "g6561" "g8581" "g4255" "g1702"
## [37] "g9013" "g8298" "g2499" "g2690" "g6702" "g6605" "g7020" "g1743" "g8261"
## [46] "g9191" "g6258" "g7717" "g4973" "g3082" "g9346" "g2522" "g6260" "g9108"
## [55] "g7869" "g6858" "g7192" "g9378"
## 
## $pathway4
##  [1] "g7398" "g6306" "g4012" "g6673" "g7713" "g2518" "g9017" "g3360" "g4784"
## [10] "g6259" "g9059" "g1799" "g9813" "g5004" "g4810" "g4174" "g2439" "g8029"
## [19] "g5730" "g7516" "g154"  "g7419" "g19"   "g4068" "g4232" "g9767" "g9163"
## [28] "g311"  "g1916" "g3805" "g8445" "g3603" "g6501" "g153"  "g2922" "g8097"
## [37] "g9985" "g9719" "g7827" "g1428" "g2116" "g2994" "g2918" "g8054" "g2103"
## [46] "g3184" "g5540" "g8584" "g8271" "g9014" "g960"  "g1951" "g2607" "g8926"
## [55] "g7844" "g1999" "g1731" "g7758" "g5109" "g9245" "g6640" "g7569" "g8124"
## [64] "g394"  "g7910" "g8973" "g366"  "g2027" "g1070" "g8787" "g5606" "g5249"
## [73] "g7203" "g6650" "g1617"
## 
## $pathway5
##  [1] "g6454" "g2022" "g9627" "g8471" "g9677" "g3516" "g3262" "g4189" "g4912"
## [10] "g3286" "g2037" "g821"  "g3957" "g2482" "g8656" "g9962" "g3550" "g5620"
## [19] "g2042" "g1207" "g8654" "g9313" "g8064" "g2393" "g757"  "g1574" "g9199"
## [28] "g7132" "g9324" "g1657" "g8535" "g5200" "g4639" "g8986" "g6650" "g488" 
## [37] "g6241" "g5407" "g1038" "g198"  "g8992" "g9396" "g9200" "g4567" "g6191"
## [46] "g9970" "g4035" "g1762" "g815"  "g1205" "g5063" "g4883" "g2993" "g3763"
## [55] "g1555" "g5455" "g8534" "g1242" "g3209" "g2687" "g3786" "g9385" "g6187"
## [64] "g3120" "g1013" "g7091" "g6423" "g1548" "g8316" "g4777" "g8392" "g2101"
## [73] "g9991" "g8915" "g3369" "g1075" "g3770" "g9501" "g7476" "g5178" "g9552"
## [82] "g1877" "g6608" "g1399" "g5689" "g4526" "g1043" "g5942" "g2524" "g5958"
## [91] "g8162"
## 
## $pathway6
##  [1] "g748"  "g2887" "g2237" "g6300" "g562"  "g1823" "g1476" "g3523" "g5496"
## [10] "g4240" "g7780" "g2982" "g3004" "g6110" "g2512" "g6411" "g7219" "g8294"

Example usage 1: apply summarize_pathway_level to summarize data with mean pooling and a minimum size of sets

Now we can apply the function summarize_pathway_level. In this example, pathway activity is summarized using the mean pooling aggregation for those sets containing at least 12 genes. Note that you can adjust the minsize and type of aggregation as desired.

min <- 12
pathway_activity <- summarize_pathway_level(X, pathways, type = "mean", minsize = min)
## 
## 100 functional sets read.
## iteration 100
## 95 successful functional aggregations over minsize
## 5 failed functional aggregations under minsize
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 95 , 20"

From the original matrix X with dimensions [g,s] (in this example, 1000 genes and 20 samples), summarize_pathway_level has generated a pathway-level activity score for each of the 20 samples, for 95 pathways having containing more than 12 genes. Let’s see how this matrix looks like:

print(head(pathway_activity))
##                 s1        s2        s3        s4        s5        s6        s7
## pathway1 0.8856707 0.7523236 0.8549943 0.7388130 0.7925286 0.8334914 0.7992642
## pathway2 0.8195863 0.8626590 0.8864375 0.7387156 0.7385007 0.7833466 0.7602959
## pathway3 0.7850385 0.7178379 0.8601687 0.7368610 0.7706303 0.7247713 0.8604497
## pathway4 0.9223525 0.6942613 0.8204792 0.8153523 0.7955479 0.9152442 0.7992134
## pathway5 0.9114610 0.7112581 0.8054646 0.8684732 0.8134853 0.7694264 0.6587641
## pathway6 0.6227958 0.8245890 0.8223787 0.7626611 0.8641242 0.9292614 0.6487087
##                 s8        s9       s10       s11       s12       s13       s14
## pathway1 0.8360021 0.6835664 0.7858110 0.7760340 0.7035882 0.7978041 0.7516856
## pathway2 0.8074307 0.6878013 0.7672425 0.7680346 0.8845340 0.7153691 0.8658347
## pathway3 0.7472542 0.7420240 0.7557788 0.8714399 0.6596007 0.7748371 0.7146510
## pathway4 0.8153612 0.8172311 0.7705315 0.8365182 0.7158393 0.7362310 0.9941277
## pathway5 0.8208152 0.8176990 0.7511963 0.8832247 0.7476202 0.7715769 0.8993376
## pathway6 0.9917968 0.6278624 0.8573334 0.6735154 0.4910625 0.7790914 0.8918141
##                s15       s16       s17       s18       s19       s20
## pathway1 0.7707884 0.8048002 0.8139325 0.7562304 0.8257812 0.7988866
## pathway2 0.7450440 0.8073472 0.8396782 0.7724498 0.7732119 0.6771755
## pathway3 0.7427666 0.9419171 0.9125146 0.7504836 0.7127936 0.7270405
## pathway4 0.7121663 0.7625575 0.7675355 0.8329739 0.8193932 0.7584810
## pathway5 0.8388787 0.8025472 0.8337763 0.7940322 0.6808545 0.7773564
## pathway6 0.7292842 0.7034269 0.6653677 0.7941846 0.7447829 0.8574432

The resulting matrix of higher-level functional representations looks very similar to the original one, except that the original had many more features (1000 instead of 95).

In this example, 5 of the gene sets in pathways has size < 12. You can check which pathways have been left out and how many genes they had by running the following command:

length_sets <- sapply(pathways, function(p) length(p))
short_sets <- names(length_sets[length_sets < min])
print(length_sets[short_sets])
##  pathway20  pathway29  pathway56  pathway60 pathway100 
##         10         11         11         11         11

You can also check which genes were found in these sets:

print(pathways[short_sets])
## $pathway20
##  [1] "g7619" "g7228" "g4784" "g7604" "g2909" "g550"  "g3498" "g9766" "g1379"
## [10] "g9024"
## 
## $pathway29
##  [1] "g9593" "g7338" "g9592" "g1214" "g5061" "g2451" "g2666" "g2299" "g836" 
## [10] "g3948" "g4898"
## 
## $pathway56
##  [1] "g1402" "g2296" "g2731" "g5182" "g3993" "g7093" "g6375" "g3336" "g1762"
## [10] "g2398" "g1667"
## 
## $pathway60
##  [1] "g3429" "g5579" "g6744" "g4102" "g7061" "g5217" "g1025" "g2838" "g1540"
## [10] "g8468" "g6575"
## 
## $pathway100
##  [1] "g3438" "g443"  "g7243" "g8206" "g2375" "g9002" "g625"  "g5807" "g6973"
## [10] "g8004" "g1757"

Example usage 2: apply summarize_pathway_level to summarize data with PCA pooling and no minimum size of sets

Using the same matrix X and gene sets pathways, let’s generate aggregated representations using dimension-reduction derived scores from the PCA. The pca-aggregated activity scores values represent the projection of the overall expression of all genes in each pathway onto the first principal component.

pathway_activity <- summarize_pathway_level(X, pathways, type = "pca", minsize = 0)
## 
## 100 functional sets read.
## iteration 100
## 100 successful functional aggregations over minsize
## 0 failed functional aggregations under minsize
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 100 , 20"

Now from the original matrix X with dimensions [g,s] (1000 genes and 20 samples), summarize_pathway_level has generated a pathway-level activity score for each of the 20 samples for all 100 pathways, since the size of the molecular sets was not restricted. Let’s see how this matrix looks like:

print(head(pathway_activity))
##                   s1         s2         s3         s4         s5          s6
## pathway1 -0.51596316  4.4081016 -3.6172441  0.9078345 -4.3632274  4.47148780
## pathway2 -6.35344708  4.4721898 -0.1072179 -5.6146386 -0.4361479  3.43595874
## pathway3 -3.69833166  1.8035359  0.9002908 -0.7440726  3.7495110  1.25453327
## pathway4 -5.71654009  1.1753201  2.3176206  1.9626837 -2.9970946  0.08046059
## pathway5 -2.45806955 -0.9199807 -6.6012194  3.9444841 -0.5009367  2.02176596
## pathway6  0.02525345  0.3790888  1.4559051  1.2151299 -4.6422525 -2.02724802
##                  s7          s8         s9        s10        s11        s12
## pathway1  0.1786907  2.68548927  0.6952866  0.2297664  4.0740206  1.1389743
## pathway2  2.0274377 -0.05016894  3.8767232  1.0640257 -1.7338074  2.4601009
## pathway3  2.1352637 -1.38294820  1.0078658  4.1406329  0.2845910 -1.6128825
## pathway4  2.8826719 -2.24804495 -2.5774802  5.7096576  2.4964751  2.6572455
## pathway5 -1.0535810 -1.78555013 -0.1863068  1.6203119  4.3788157 -0.1565894
## pathway6 -2.0024241  2.17669923 -0.3979514 -1.7318485  0.6773351 -0.8882183
##                 s13       s14        s15        s16        s17        s18
## pathway1 -0.2447407 -4.932566  2.6488582  0.5928979 -2.2855632  1.0699588
## pathway2  1.4911608 -1.204245  1.1324795 -3.3654006 -0.6572542 -1.8564556
## pathway3 -0.8852759 -4.948525 -1.5682527  4.3381664  1.0671926 -0.6944561
## pathway4  0.2437641 -2.979152  0.8372801 -1.5266307  2.8745670 -0.5905516
## pathway5 -2.0571340  6.301765  2.0743195 -3.5761567  2.6095770  0.6137905
## pathway6 -2.4032898  2.959492  2.1802264 -0.1962708  0.8000977  2.0549126
##                 s19        s20
## pathway1 -5.2955798 -1.8464827
## pathway2 -0.2952237  1.7139303
## pathway3 -2.6939543 -2.4528843
## pathway4 -2.7427071 -1.8595451
## pathway5 -1.9449648 -2.3243405
## pathway6 -0.2149931  0.5803559

Function get_kegg_sets

The function get_kegg_sets retrieves KEGG pathway gene sets for a specified organism. It fetches all pathways available for the specified organism from the KEGG database and maps the genes involved in each pathway. Currently, the function only supports choice of gene identifiers (entrez IDs, gene symbols and Ensembl IDs) for Homo sapiens (organism = “hsa”) using the org.Hs.eg.db package.

get_kegg_sets has two parameters: organism and geneid_type. The parameter organism provides the organism abbreviation for which KEGG pathway gene sets are to be retrieved (e.g., “ecj” for E. coli). Default is “hsa” (Homo sapiens). geneid_type provides the type of gene IDs to provide and is only used when the organism is “hsa” (Homo sapiens). The default is “entrez”; options are “entrez”, “symbol”, or “ensembl”.

The function get_kegg_sets() returns a list where each element represents a KEGG pathway gene set (i.e., a list of lists). The names of the inner lists correspond to the pathway names.

Examples usage

Let’s retrieve KEGG pathway gene sets for Homo sapiens with entrez IDs (default):

hsa_kegg_sets_entrez <- get_kegg_sets()
head(hsa_kegg_sets_entrez)
## $`2-Oxocarboxylic acid metabolism`
##  [1] "100526760" "137362"    "1431"      "162417"    "1629"      "1737"     
##  [7] "1738"      "1743"      "189"       "2805"      "2806"      "2875"     
## [13] "3417"      "3418"      "3419"      "3420"      "3421"      "48"       
## [19] "4967"      "50"        "51166"     "5160"      "5161"      "5162"     
## [25] "55526"     "55753"     "586"       "587"       "593"       "594"      
## [31] "64902"     "84706"     "95"       
## 
## $`ABC transporters`
##  [1] "10057"  "10058"  "10060"  "10257"  "10347"  "10349"  "10350"  "10351" 
##  [9] "1080"   "11194"  "1244"   "154664" "1672"   "19"     "20"     "21"    
## [17] "215"    "22"     "225"    "23456"  "23457"  "23460"  "23461"  "24"    
## [25] "26154"  "340273" "368"    "4363"   "5243"   "5244"   "5825"   "5826"  
## [33] "64137"  "64240"  "64241"  "6833"   "6890"   "6891"   "85320"  "8647"  
## [41] "8714"   "89845"  "94160"  "9429"   "9619"  
## 
## $`AGE-RAGE signaling pathway in diabetic complications`
##   [1] "10000"  "1019"   "1027"   "113026" "1277"   "1278"   "1281"   "1282"  
##   [9] "1284"   "1285"   "1286"   "1287"   "1288"   "1432"   "1536"   "1729"  
##  [17] "177"    "183"    "185"    "1906"   "1958"   "207"    "208"    "2152"  
##  [25] "2277"   "2308"   "23236"  "2335"   "27035"  "3265"   "3383"   "3552"  
##  [33] "3553"   "3569"   "3576"   "3717"   "3725"   "3845"   "4087"   "4088"  
##  [41] "4089"   "4313"   "4772"   "4790"   "4846"   "4893"   "50507"  "5054"  
##  [49] "51196"  "5290"   "5291"   "5292"   "5293"   "5295"   "5296"   "5330"  
##  [57] "5331"   "5332"   "5333"   "5335"   "5336"   "5578"   "5579"   "5580"  
##  [65] "5581"   "5590"   "5594"   "5595"   "5599"   "5600"   "5601"   "5602"  
##  [73] "5603"   "581"    "5879"   "595"    "596"    "5970"   "6300"   "6347"  
##  [81] "6401"   "6772"   "6774"   "6776"   "6777"   "7040"   "7042"   "7043"  
##  [89] "7046"   "7048"   "7056"   "7124"   "7412"   "7422"   "7423"   "7424"  
##  [97] "836"    "84812"  "8503"   "998"   
## 
## $`AMPK signaling pathway`
##   [1] "10000"  "10488"  "10645"  "1080"   "10890"  "10891"  "1149"   "126129"
##   [9] "1374"   "1375"   "1385"   "148"    "148327" "1938"   "1978"   "1994"  
##  [17] "200186" "207"    "208"    "2194"   "2203"   "2308"   "2309"   "23216" 
##  [25] "23411"  "23417"  "2475"   "2538"   "28227"  "29904"  "2997"   "2998"  
##  [33] "31"     "3156"   "3172"   "32"     "3479"   "3480"   "3630"   "3643"  
##  [41] "3667"   "3952"   "3953"   "3991"   "4218"   "5105"   "5106"   "51094" 
##  [49] "51422"  "51552"  "5170"   "51719"  "5207"   "5208"   "5209"   "5210"  
##  [57] "5211"   "5213"   "5214"   "5290"   "5291"   "5293"   "5295"   "5296"  
##  [65] "53632"  "5468"   "55012"  "5515"   "5516"   "5518"   "5519"   "5520"  
##  [73] "5521"   "5522"   "5523"   "5525"   "5526"   "5527"   "5528"   "5529"  
##  [81] "55437"  "5562"   "5563"   "5564"   "5565"   "5571"   "55844"  "57521" 
##  [89] "57818"  "5862"   "595"    "6009"   "6198"   "6199"   "6319"   "64764" 
##  [97] "6517"   "6720"   "6794"   "6885"   "7248"   "7249"   "79602"  "79966" 
## [105] "81617"  "8408"   "84335"  "84699"  "8471"   "8503"   "8660"   "8789"  
## [113] "890"    "8900"   "90993"  "9230"   "92335"  "92579"  "9370"   "948"   
## [121] "9586"  
## 
## $`ATP-dependent chromatin remodeling`
##   [1] "10263"     "10445"     "10467"     "10524"     "10847"     "10856"    
##   [7] "10902"     "10933"     "1107"      "1108"      "11176"     "11177"    
##  [13] "115482686" "124944"    "125476"    "196528"    "2186"      "221613"   
##  [19] "23506"     "26039"     "26122"     "27443"     "283899"    "29117"    
##  [25] "29844"     "29994"     "29998"     "3012"      "3013"      "3014"     
##  [31] "3015"      "3065"      "3066"      "317772"    "404281"    "474381"   
##  [37] "474382"    "4798"      "51377"     "51412"     "51773"     "53615"    
##  [43] "54107"     "54108"     "54556"     "54617"     "54815"     "54891"    
##  [49] "55193"     "55257"     "55274"     "55506"     "55766"     "55929"    
##  [55] "57459"     "57492"     "57504"     "57634"     "5928"      "5931"     
##  [61] "5977"      "60"        "605"       "64431"     "64769"     "6594"     
##  [67] "6595"      "6597"      "6598"      "65980"     "6599"      "6601"     
##  [73] "6602"      "6603"      "6604"      "6605"      "6760"      "6944"     
##  [79] "71"        "723790"    "7528"      "79913"     "80314"     "8089"     
##  [85] "8099"      "8110"      "8193"      "8289"      "8295"      "8329"     
##  [91] "8330"      "8331"      "8332"      "8334"      "8335"      "8336"     
##  [97] "8337"      "8338"      "83444"     "83740"     "84295"     "8467"     
## [103] "85235"     "86"        "8607"      "8932"      "8969"      "9031"     
## [109] "9112"      "9219"      "9274"      "9275"      "92815"     "93973"    
## [115] "94239"     "9555"      "9643"     
## 
## $`Acute myeloid leukemia`
##  [1] "10000" "1050"  "1053"  "11040" "1147"  "1436"  "1437"  "1848"  "1978" 
## [10] "207"   "208"   "2209"  "2322"  "2475"  "2885"  "3265"  "3551"  "3562" 
## [19] "3684"  "369"   "3728"  "3815"  "3845"  "4353"  "4609"  "4790"  "4893" 
## [28] "51176" "5290"  "5291"  "5292"  "5293"  "5295"  "5296"  "5371"  "5467" 
## [37] "5594"  "5595"  "5604"  "5605"  "572"   "5894"  "5914"  "595"   "597"  
## [46] "5970"  "6198"  "6199"  "6654"  "6655"  "6688"  "673"   "6774"  "6776" 
## [55] "6777"  "6932"  "6934"  "7704"  "83439" "8503"  "8517"  "861"   "862"  
## [64] "8864"  "890"   "8900"  "929"

The KEGG molecular sets can also be retrieved for gene symbols with the geneid_type = "symbol" flag:

hsa_kegg_sets_symbol <- get_kegg_sets(geneid_type = "symbol")
hsa_kegg_sets_symbol[1]
## $`2-Oxocarboxylic acid metabolism`
##  [1] "ABHD14A-ACY1" "GOT1L1"       "CS"           "NAGS"         "DBT"         
##  [6] "DLAT"         "DLD"          "DLST"         "AGXT"         "GOT1"        
## [11] "GOT2"         "GPT"          "IDH1"         "IDH2"         "IDH3A"       
## [16] "IDH3B"        "IDH3G"        "ACO1"         "OGDH"         "ACO2"        
## [21] "AADAT"        "PDHA1"        "PDHA2"        "PDHB"         "DHTKD1"      
## [26] "OGDHL"        "BCAT1"        "BCAT2"        "BCKDHA"       "BCKDHB"      
## [31] "AGXT2"        "GPT2"         "ACY1"

And similarly for Ensembl IDs with the geneid_type = "ensembl" flag:

hsa_kegg_sets_ensembl <- get_kegg_sets(geneid_type = "ensembl")
hsa_kegg_sets_ensembl[1]
## $`2-Oxocarboxylic acid metabolism`
##  [1] "ENSG00000114786" "ENSG00000169154" "ENSG00000062485" "ENSG00000161653"
##  [5] "ENSG00000137992" "ENSG00000150768" "ENSG00000091140" "ENSG00000119689"
##  [9] "ENSG00000172482" "ENSG00000120053" "ENSG00000125166" "ENSG00000167701"
## [13] "ENSG00000138413" "ENSG00000182054" "ENSG00000166411" "ENSG00000101365"
## [17] "ENSG00000067829" "ENSG00000122729" "ENSG00000105953" "ENSG00000100412"
## [21] "ENSG00000109576" "ENSG00000131828" "ENSG00000163114" "ENSG00000168291"
## [25] "ENSG00000181192" "ENSG00000197444" "ENSG00000060982" "ENSG00000105552"
## [29] "ENSG00000248098" "ENSG00000083123" "ENSG00000113492" "ENSG00000166123"
## [33] "ENSG00000243989"

get_kegg_sets can also be used to retrieve KEGG pathway gene sets for another organism (e.g., Escherichia coli). Note that the choice of gene identifier is currently not supported for organisms other than Homo sapiens, hence the gene type is that stored by the KEGG database.

ecoli_kegg_sets <- get_kegg_sets(organism = "ecj")
head(ecoli_kegg_sets)
## $`2-Oxocarboxylic acid metabolism`
##  [1] "JW0070" "JW0071" "JW0073" "JW0076" "JW0077" "JW0110" "JW0111" "JW0112"
##  [9] "JW0114" "JW0710" "JW0715" "JW0716" "JW0911" "JW1122" "JW1268" "JW1372"
## [17] "JW2287" "JW2786" "JW3322" "JW3396" "JW3645" "JW3646" "JW3742" "JW3747"
## [25] "JW3929" "JW3930" "JW3984" "JW5103" "JW5553" "JW5605" "JW5606" "JW5807"
## 
## $`ABC transporters`
##   [1] "JW0065" "JW0066" "JW0067" "JW0147" "JW0148" "JW0149" "JW0154" "JW0193"
##   [9] "JW0194" "JW0195" "JW0254" "JW0255" "JW0357" "JW0358" "JW0359" "JW0438"
##  [17] "JW0580" "JW0581" "JW0582" "JW0584" "JW0647" "JW0648" "JW0649" "JW0743"
##  [25] "JW0746" "JW0747" "JW0748" "JW0794" "JW0795" "JW0796" "JW0815" "JW0816"
##  [33] "JW0838" "JW0840" "JW0841" "JW0844" "JW0845" "JW0846" "JW0847" "JW0848"
##  [41] "JW0863" "JW0869" "JW0870" "JW0897" "JW0916" "JW0919" "JW1104" "JW1109"
##  [49] "JW1110" "JW1111" "JW1112" "JW1235" "JW1236" "JW1237" "JW1238" "JW1239"
##  [57] "JW1283" "JW1284" "JW1285" "JW1286" "JW1287" "JW1311" "JW1322" "JW1506"
##  [65] "JW1507" "JW1508" "JW1509" "JW1699" "JW1701" "JW1847" "JW1848" "JW1887"
##  [73] "JW1888" "JW1889" "JW1902" "JW1903" "JW1905" "JW2116" "JW2117" "JW2118"
##  [81] "JW2119" "JW2135" "JW2136" "JW2137" "JW2165" "JW2166" "JW2167" "JW2168"
##  [89] "JW2186" "JW2187" "JW2188" "JW2199" "JW2303" "JW2304" "JW2305" "JW2306"
##  [97] "JW2307" "JW2415" "JW2416" "JW2417" "JW2418" "JW2530" "JW2531" "JW2532"
## [105] "JW2652" "JW2653" "JW2654" "JW3159" "JW3160" "JW3161" "JW3162" "JW3168"
## [113] "JW3236" "JW3239" "JW3415" "JW3416" "JW3417" "JW3418" "JW3419" "JW3420"
## [121] "JW3421" "JW3422" "JW3423" "JW3425" "JW3427" "JW3428" "JW3441" "JW3442"
## [129] "JW3443" "JW3444" "JW3445" "JW3509" "JW3510" "JW3511" "JW3512" "JW3513"
## [137] "JW3538" "JW3539" "JW3540" "JW3635" "JW3703" "JW3704" "JW3705" "JW3706"
## [145] "JW3728" "JW3729" "JW3730" "JW3888" "JW3992" "JW3993" "JW3994" "JW3995"
## [153] "JW4047" "JW4048" "JW4049" "JW4066" "JW4067" "JW4186" "JW4218" "JW4247"
## [161] "JW4248" "JW4249" "JW4250" "JW5061" "JW5092" "JW5111" "JW5121" "JW5161"
## [169] "JW5162" "JW5242" "JW5366" "JW5535" "JW5544" "JW5545" "JW5752" "JW5753"
## [177] "JW5754" "JW5760" "JW5818" "JW5831" "JW5857" "JW5897"
## 
## $`Acarbose and validamycin biosynthesis`
## [1] "JW2024" "JW2026" "JW3763" "JW5598"
## 
## $`Alanine, aspartate and glutamate metabolism`
##  [1] "JW0030" "JW0031" "JW0474" "JW0660" "JW0812" "JW0911" "JW0999" "JW1117"
##  [9] "JW1295" "JW1488" "JW1517" "JW1750" "JW1756" "JW2287" "JW2309" "JW2558"
## [17] "JW2636" "JW2637" "JW2924" "JW3140" "JW3179" "JW3180" "JW3485" "JW3707"
## [25] "JW3722" "JW3841" "JW3932" "JW4099" "JW4135" "JW4203" "JW4204" "JW5019"
## [33] "JW5247"
## 
## $`Amino sugar and nucleotide sugar metabolism`
##  [1] "JW0264" "JW0385" "JW0663" "JW0664" "JW0665" "JW0675" "JW0740" "JW0741"
##  [9] "JW0742" "JW1087" "JW1093" "JW1105" "JW1224" "JW1605" "JW1613" "JW1632"
## [17] "JW1806" "JW1807" "JW1808" "JW2010" "JW2021" "JW2027" "JW2033" "JW2034"
## [25] "JW2037" "JW2038" "JW2248" "JW2249" "JW2250" "JW2385" "JW2410" "JW2421"
## [33] "JW2422" "JW3143" "JW3156" "JW3192" "JW3194" "JW3300" "JW3393" "JW3707"
## [41] "JW3708" "JW3940" "JW3985" "JW5372" "JW5538" "JW5599" "JW5600"
## 
## $`Aminoacyl-tRNA biosynthesis`
##   [1] "JW0024"  "JW0190"  "JW0515"  "JW0637"  "JW0666"  "JW0876"  "JW0913" 
##   [8] "JW1629"  "JW1703"  "JW1709"  "JW1855"  "JW1865"  "JW2101"  "JW2395" 
##  [15] "JW2498"  "JW2667"  "JW2858"  "JW3249"  "JW3347"  "JW3530"  "JW3531" 
##  [22] "JW3564"  "JW4090"  "JW4215"  "JW5277"  "JWR0002" "JWR0003" "JWR0006"
##  [29] "JWR0007" "JWR0008" "JWR0010" "JWR0011" "JWR0012" "JWR0013" "JWR0014"
##  [36] "JWR0015" "JWR0016" "JWR0017" "JWR0018" "JWR0019" "JWR0020" "JWR0021"
##  [43] "JWR0022" "JWR0023" "JWR0024" "JWR0025" "JWR0026" "JWR0027" "JWR0028"
##  [50] "JWR0029" "JWR0031" "JWR0032" "JWR0033" "JWR0034" "JWR0035" "JWR0037"
##  [57] "JWR0038" "JWR0039" "JWR0040" "JWR0041" "JWR0042" "JWR0044" "JWR0045"
##  [64] "JWR0046" "JWR0047" "JWR0048" "JWR0049" "JWR0050" "JWR0053" "JWR0056"
##  [71] "JWR0057" "JWR0058" "JWR0059" "JWR0060" "JWR0061" "JWR0066" "JWR0068"
##  [78] "JWR0069" "JWR0071" "JWR0072" "JWR0074" "JWR0077" "JWR0078" "JWR0080"
##  [85] "JWR0081" "JWR0083" "JWR0086" "JWR0087" "JWR0088" "JWR0089" "JWR0090"
##  [92] "JWR0091" "JWR0093" "JWR0094" "JWR0100" "JWR0103" "JWR0105" "JWR0106"
##  [99] "JWR0108" "JWR0111" "JWR0112" "JWR0113" "JWR0114" "JWR0115" "JWR0206"
## [106] "JWR0207" "JWR0208" "JWR0213" "JWR0221" "JWR0222" "JWR0223"

Packages & Session information

The funOmics package was developed for R version >= 4.0.3. However, BioConductor release 3.19 runs on R-4.4. See session information and loaded packages below:

sI <- sessionInfo()
print(sI, locale = FALSE)
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] funOmics_0.99.6     bigmemory_4.6.4     Biobase_2.64.0     
## [4] BiocGenerics_0.50.0
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.44.0         gtable_0.3.5            xfun_0.43              
##  [4] bslib_0.7.0             ggplot2_3.5.1           vctrs_0.6.5            
##  [7] tools_4.4.0             generics_0.1.3          curl_5.2.1             
## [10] stats4_4.4.0            parallel_4.4.0          tibble_3.2.1           
## [13] fansi_1.0.6             AnnotationDbi_1.66.0    RSQLite_2.3.6          
## [16] cluster_2.1.6           blob_1.2.4              R.oo_1.26.0            
## [19] pkgconfig_2.0.3         RColorBrewer_1.1-3      rngtools_1.5.2         
## [22] S4Vectors_0.42.0        uuid_1.2-0              lifecycle_1.0.4        
## [25] GenomeInfoDbData_1.2.12 stringr_1.5.1           compiler_4.4.0         
## [28] Biostrings_2.72.0       munsell_0.5.1           codetools_0.2-20       
## [31] GenomeInfoDb_1.40.0     htmltools_0.5.8.1       sass_0.4.9             
## [34] yaml_2.3.8              pillar_1.9.0            crayon_1.5.2           
## [37] jquerylib_0.1.4         cachem_1.0.8            org.Hs.eg.db_3.19.1    
## [40] iterators_1.0.14        foreach_1.5.2           princurve_2.1.6        
## [43] tidyselect_1.2.1        digest_0.6.35           stringi_1.8.4          
## [46] reshape2_1.4.4          dplyr_1.1.4             fastmap_1.1.1          
## [49] grid_4.4.0              colorspace_2.1-0        cli_3.6.2              
## [52] magrittr_2.0.3          utf8_1.2.4              bigmemory.sri_0.1.8    
## [55] withr_3.0.0             scales_1.3.0            UCSC.utils_1.0.0       
## [58] bit64_4.0.5             registry_0.5-1          rmarkdown_2.26         
## [61] XVector_0.44.0          httr_1.4.7              bit_4.0.5              
## [64] R.methodsS3_1.8.2       png_0.1-8               memoise_2.0.1          
## [67] evaluate_0.23           knitr_1.46              IRanges_2.38.0         
## [70] doParallel_1.0.17       NMF_0.27                rlang_1.1.3            
## [73] Rcpp_1.0.12             gridBase_0.4-7          glue_1.7.0             
## [76] DBI_1.2.2               pathifier_1.42.0        jsonlite_1.8.8         
## [79] plyr_1.8.9              R6_2.5.1                zlibbioc_1.50.0

Real-world data: Where to find molecular sets beyond KEGG?

The package funOmics interoperates with KEGGREST to retrieve molecular sets from the KEGG through the function get_kegg_sets (see description and example above). Other real-world molecular sets can be downloaded from several sources. In terms of gene sets, the Gene Ontology is a versatile resource that covers three domains: cellular components, biological processes and molecular functions. Reactome pathways can also be used to generate higher-level functional representations from omics data. Explore the different releases and download the corresponding gene sets for the different types of GO terms, and reactome pathways here. You can also aggregate genes into protein complexes, which you can find in the CORUM database.

Regarding other omics types, such as metabolomics, the function summarize_pathway_level can be applied in a similar manner to a metabolomics matrix X and KEGG metabolic pathways. Metabolite sets from KEGG pathways can also be downloaded with the KEGG API.

After obtaining the molecular sets information, this data has to be formatted as a list of lists (similar to what is obtained from the get_kegg_sets function). In other words, you need a structure where you have a list of multiple molecular sets names, and each of these sets is represented as a list of molecule identifiers, such as entrez IDs, PubChem CIDs, Uniprot IDs, etc. For instance, let’s retrieve gene sets from GO terms for cellular compartments. The information can be downloaded from the msigdb link or accessed programmatically as follows:

goccdb <- "https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.5/c5.go.cc.v7.5.entrez.gmt"
downdb <- sapply(readLines(goccdb), function(x) strsplit(x, "\t")[[1]])
gocc <- sapply(as.matrix(downdb), function(x) x[3:length(x)])
names(gocc) = sapply(as.matrix(downdb), function(x) x[1])
gocc[1:3]
## $GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX
##  [1] "1069" "1161" "2067" "2071" "2072" "2073" "5424" "5887" "7507" "7508"
## [11] "7515"
## 
## $GOCC_HISTONE_DEACETYLASE_COMPLEX
##  [1] "10013"  "10014"  "10284"  "10428"  "10467"  "10524"  "10629"  "10847" 
##  [9] "10856"  "10902"  "10933"  "1107"   "1108"   "116092" "1457"   "221037"
## [17] "23186"  "23309"  "23314"  "23468"  "25855"  "25942"  "26038"  "283248"
## [25] "3065"   "3066"   "3094"   "3622"   "473"    "51317"  "51564"  "51780" 
## [33] "53615"  "54556"  "54815"  "55758"  "55806"  "55809"  "55818"  "55869" 
## [41] "55929"  "57459"  "57504"  "57634"  "57649"  "58516"  "5928"   "5931"  
## [49] "64426"  "64431"  "6907"   "7764"   "79595"  "79685"  "79718"  "79885" 
## [57] "81611"  "8204"   "8295"   "83933"  "84215"  "84312"  "8607"   "8819"  
## [65] "8841"   "90665"  "9112"   "91748"  "9219"   "9611"   "9734"   "9759"  
## 
## $GOCC_SAGA_COMPLEX
##  [1] "100130302" "10474"     "10629"     "112869"    "117143"    "170067"   
##  [7] "23326"     "2648"      "27097"     "55578"     "56943"     "56970"    
## [13] "6871"      "6878"      "6880"      "6881"      "6883"      "8295"     
## [19] "8464"      "8850"      "9913"

Now let’s simulate a new gene expression matrix X, where gene IDs are codes between 1:10000 (to match entrez IDs), and summarize_pathway_level can be applied:

X_expr <- matrix(abs(rnorm(g * s)), nrow = g, dimnames = list(1:g, paste0("s", 1:s)))
sd_gocc_expr <- summarize_pathway_level(X_expr, gocc, type="mds")
## 
## 1006 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## iteration 400
## iteration 500
## iteration 600
## iteration 700
## iteration 800
## iteration 900
## iteration 1000
## 484 successful functional aggregations over minsize
## 522 failed functional aggregations under minsize
head(sd_gocc_expr)
##                                                 s1         s2         s3
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX  0.0462303 -0.6140595  0.2891945
## GOCC_HISTONE_DEACETYLASE_COMPLEX        -1.6204184 -0.8853372 -0.8441893
## GOCC_SAGA_COMPLEX                        0.2054268  0.6999517  1.0480322
## GOCC_GOLGI_MEMBRANE                      1.9815854 -0.5086268  1.6709114
## GOCC_UBIQUITIN_LIGASE_COMPLEX            1.8694237 -0.2358459  0.6526294
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   -0.7112492 -0.1497512 -1.5497636
##                                                 s4         s5          s6
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX -0.8756909  0.4240243 -0.94606422
## GOCC_HISTONE_DEACETYLASE_COMPLEX         0.6049920  0.2855740 -0.01321807
## GOCC_SAGA_COMPLEX                       -1.7824131 -1.1354220  0.82292275
## GOCC_GOLGI_MEMBRANE                      2.4324265 -1.7061610 -0.73507657
## GOCC_UBIQUITIN_LIGASE_COMPLEX           -0.2676045  2.6318200  1.68601711
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   -0.1830470 -1.3865100 -0.44164205
##                                                 s7         s8          s9
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX -2.4570975  0.5727501 -0.07099186
## GOCC_HISTONE_DEACETYLASE_COMPLEX         1.0430082  0.2646195 -1.10033657
## GOCC_SAGA_COMPLEX                        0.2526233  0.9617341  0.75911755
## GOCC_GOLGI_MEMBRANE                      1.3060649 -1.5630098 -0.04296835
## GOCC_UBIQUITIN_LIGASE_COMPLEX           -1.3591951 -1.5387335 -1.23310884
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX    0.2096453  1.1757629  0.68596523
##                                                s10        s11         s12
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX -0.7165690 -0.2332708 -0.06910369
## GOCC_HISTONE_DEACETYLASE_COMPLEX         0.4220277  0.1309214  1.30808763
## GOCC_SAGA_COMPLEX                       -0.7395053  0.6318837  0.21086370
## GOCC_GOLGI_MEMBRANE                     -2.8041600 -2.1877036 -0.64993252
## GOCC_UBIQUITIN_LIGASE_COMPLEX            2.8932146 -0.3182926 -1.24465226
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   -0.9007953  0.3328987  0.87804574
##                                                s13        s14         s15
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX  1.2496402  1.4496191  0.05127479
## GOCC_HISTONE_DEACETYLASE_COMPLEX        -1.5681261  2.1238127 -0.89555776
## GOCC_SAGA_COMPLEX                        0.1166933 -1.6341811 -0.66533764
## GOCC_GOLGI_MEMBRANE                      1.1191507  2.5492052  0.03721921
## GOCC_UBIQUITIN_LIGASE_COMPLEX           -0.3082762 -0.8082058  0.40497635
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX    0.7070152  0.1573891 -1.66223909
##                                                s16        s17        s18
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX  1.1222140 -0.0507457 -0.2654060
## GOCC_HISTONE_DEACETYLASE_COMPLEX        -1.3111879 -0.2407197  1.6220072
## GOCC_SAGA_COMPLEX                       -0.7823400  0.8853010 -1.2406977
## GOCC_GOLGI_MEMBRANE                      7.9189684 -2.9951731 -1.0774628
## GOCC_UBIQUITIN_LIGASE_COMPLEX            1.0485494  1.5836014 -0.3060758
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX   -0.4851415 -1.2569918  1.1589200
##                                                s19        s20
## GOCC_NUCLEOTIDE_EXCISION_REPAIR_COMPLEX  0.6041332  0.4899186
## GOCC_HISTONE_DEACETYLASE_COMPLEX        -0.1013482  0.7753888
## GOCC_SAGA_COMPLEX                        0.1340518  1.2512948
## GOCC_GOLGI_MEMBRANE                      0.4618573 -5.2071145
## GOCC_UBIQUITIN_LIGASE_COMPLEX           -0.6981583 -4.4520832
## GOCC_NUCLEAR_UBIQUITIN_LIGASE_COMPLEX    0.5884481  2.8330405

GO cellular compartments level expression signatures have been generated via dimension-reduction multi-dimensional scaling (mds) in this case. Note that these are just some examples, and you can apply similar procedures for other types of molecular sets. The package funOmics is conceived to be flexible across omics types and types of molecular sets, so you can also tailor or directly create your own list of molecular sets based on specific criteria of your experiments (e.g., include only protein complexes involved in ubiquitination, or define ad hoc metabolic routes involving specific metabolites).

A real-world example with real-world data: Summarizing omics data in SummarizedExperiment format into KEGG pathway level functional activity scores with NMF dimension-reduction aggregation

The examples in the previous sections run on simulated gene expression data. Now that you know where and how to obtain real-world molecular sets, let’s see how to use the funOmics package on a real-world omics dataset. This is an illustrated example of applying the function summarize_pathway_level to aggregate data in SummarizedExperiment format (instead of dataframe or matrix format as shown in previous examples 1 and 2) into NMF (Non-Negative Matrix Factorization) dimension-reduction derived activity scores at KEGG pathway level.

The NMF-aggregated activity scores values represent the weight (or contribution) of a single underlying basis component or latent factor contributing to the pathway activity (or higher level functional structure in use) for each sample in your data set. Note that rank=1 is used for the basis matrix in the internal NMF dimension-reduction.

Let’s first get an example dataset stored as a SummarizedExperiment from the airway package. This data represents an actual RNA sequencing experiment on four human airway smooth muscle cell lines.

library(SummarizedExperiment)
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
## 
##     rowMedians
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
library(airway)
data(airway)
airway
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

The measurement data can be accessed by assay and assays. Note that SummarizedExperiment object can contain multiple measurement matrices (all of the same dimension), but in this case airway contains only one matrix of RNA sequencing data named counts:

assayNames(airway)
## [1] "counts"
head(assay(airway, "counts"))
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0
dim(assay(airway, "counts"))
## [1] 63677     8

The data matrix contains 63677 genes (or transcripts) and 8 samples. The features names are Ensembl identifiers, let’s get a list of KEGG gene sets with Ensembl IDs through the function get_kegg_sets provided by funOmics package. Note that get_kegg_sets can be used to retrieve a list of KEGG gene sets from any organism available, given its abbreviation (e.g., “hsa” for Homo sapiens or “ecj” for Escherichia coli).

Since airway data corresponds to human samples, the parameter geneid_type in get_kegg_sets can be used to retrieve the molecular sets with Ensembl IDs, and the organism is set to default (“hsa”):

kegg_sets <- get_kegg_sets(geneid_type = "ensembl")
head(kegg_sets)
## $`2-Oxocarboxylic acid metabolism`
##  [1] "ENSG00000114786" "ENSG00000169154" "ENSG00000062485" "ENSG00000161653"
##  [5] "ENSG00000137992" "ENSG00000150768" "ENSG00000091140" "ENSG00000119689"
##  [9] "ENSG00000172482" "ENSG00000120053" "ENSG00000125166" "ENSG00000167701"
## [13] "ENSG00000138413" "ENSG00000182054" "ENSG00000166411" "ENSG00000101365"
## [17] "ENSG00000067829" "ENSG00000122729" "ENSG00000105953" "ENSG00000100412"
## [21] "ENSG00000109576" "ENSG00000131828" "ENSG00000163114" "ENSG00000168291"
## [25] "ENSG00000181192" "ENSG00000197444" "ENSG00000060982" "ENSG00000105552"
## [29] "ENSG00000248098" "ENSG00000083123" "ENSG00000113492" "ENSG00000166123"
## [33] "ENSG00000243989"
## 
## $`ABC transporters`
##  [1] "ENSG00000114770" "ENSG00000115657" "ENSG00000069431" "ENSG00000125257"
##  [5] "ENSG00000064687" "ENSG00000154263" "ENSG00000154258" "ENSG00000141338"
##  [9] "ENSG00000001626" "ENSG00000197150" "ENSG00000023839" "ENSG00000179869"
## [13] "ENSG00000164825" "ENSG00000165029" "ENSG00000107331" "ENSG00000167972"
## [17] "ENSG00000101986" "ENSG00000131269" "ENSG00000173208" "ENSG00000135776"
## [21] "ENSG00000150967" "ENSG00000154262" "ENSG00000154265" "ENSG00000198691"
## [25] "ENSG00000144452" "ENSG00000004846" "ENSG00000091262" "ENSG00000103222"
## [29] "ENSG00000085563" "ENSG00000005471" "ENSG00000117528" "ENSG00000119688"
## [33] "ENSG00000172350" "ENSG00000138075" "ENSG00000143921" "ENSG00000006071"
## [37] "ENSG00000168394" "ENSG00000204267" "ENSG00000121270" "ENSG00000073734"
## [41] "ENSG00000108846" "ENSG00000124574" "ENSG00000140798" "ENSG00000118777"
## [45] "ENSG00000160179"
## 
## $`AGE-RAGE signaling pathway in diabetic complications`
##   [1] "ENSG00000117020" "ENSG00000135446" "ENSG00000111276" "ENSG00000161714"
##   [5] "ENSG00000108821" "ENSG00000164692" "ENSG00000168542" "ENSG00000187498"
##   [9] "ENSG00000134871" "ENSG00000169031" "ENSG00000081052" "ENSG00000188153"
##  [13] "ENSG00000197565" "ENSG00000112062" "ENSG00000165168" "ENSG00000131504"
##  [17] "ENSG00000204305" "ENSG00000135744" "ENSG00000144891" "ENSG00000078401"
##  [21] "ENSG00000120738" "ENSG00000142208" "ENSG00000105221" "ENSG00000117525"
##  [25] "ENSG00000165197" "ENSG00000150907" "ENSG00000182621" "ENSG00000115414"
##  [29] "ENSG00000007952" "ENSG00000174775" "ENSG00000090339" "ENSG00000115008"
##  [33] "ENSG00000125538" "ENSG00000136244" "ENSG00000169429" "ENSG00000096968"
##  [37] "ENSG00000177606" "ENSG00000133703" "ENSG00000175387" "ENSG00000166949"
##  [41] "ENSG00000141646" "ENSG00000087245" "ENSG00000131196" "ENSG00000109320"
##  [45] "ENSG00000164867" "ENSG00000213281" "ENSG00000086991" "ENSG00000106366"
##  [49] "ENSG00000138193" "ENSG00000121879" "ENSG00000051382" "ENSG00000137193"
##  [53] "ENSG00000171608" "ENSG00000145675" "ENSG00000105647" "ENSG00000137841"
##  [57] "ENSG00000149782" "ENSG00000101333" "ENSG00000187091" "ENSG00000124181"
##  [61] "ENSG00000197943" "ENSG00000154229" "ENSG00000166501" "ENSG00000163932"
##  [65] "ENSG00000171132" "ENSG00000067606" "ENSG00000100030" "ENSG00000102882"
##  [69] "ENSG00000107643" "ENSG00000185386" "ENSG00000050748" "ENSG00000109339"
##  [73] "ENSG00000156711" "ENSG00000087088" "ENSG00000136238" "ENSG00000110092"
##  [77] "ENSG00000171791" "ENSG00000173039" "ENSG00000188130" "ENSG00000108691"
##  [81] "ENSG00000007908" "ENSG00000115415" "ENSG00000168610" "ENSG00000126561"
##  [85] "ENSG00000173757" "ENSG00000105329" "ENSG00000092969" "ENSG00000119699"
##  [89] "ENSG00000106799" "ENSG00000163513" "ENSG00000178726" "ENSG00000232810"
##  [93] "ENSG00000162692" "ENSG00000112715" "ENSG00000173511" "ENSG00000150630"
##  [97] "ENSG00000164305" "ENSG00000115556" "ENSG00000117461" "ENSG00000070831"
## 
## $`AMPK signaling pathway`
##   [1] "ENSG00000117020" "ENSG00000107175" "ENSG00000110931" "ENSG00000001626"
##   [5] "ENSG00000084733" "ENSG00000109819" "ENSG00000176194" "ENSG00000169169"
##   [9] "ENSG00000110090" "ENSG00000205560" "ENSG00000118260" "ENSG00000120907"
##  [13] "ENSG00000143578" "ENSG00000167658" "ENSG00000187840" "ENSG00000066044"
##  [17] "ENSG00000160741" "ENSG00000142208" "ENSG00000105221" "ENSG00000169710"
##  [21] "ENSG00000165140" "ENSG00000150907" "ENSG00000118689" "ENSG00000065882"
##  [25] "ENSG00000096717" "ENSG00000103150" "ENSG00000198793" "ENSG00000131482"
##  [29] "ENSG00000167393" "ENSG00000103319" "ENSG00000104812" "ENSG00000111713"
##  [33] "ENSG00000278540" "ENSG00000113161" "ENSG00000101076" "ENSG00000076555"
##  [37] "ENSG00000017427" "ENSG00000140443" "ENSG00000254647" "ENSG00000171105"
##  [41] "ENSG00000169047" "ENSG00000174697" "ENSG00000116678" "ENSG00000079435"
##  [45] "ENSG00000167461" "ENSG00000124253" "ENSG00000100889" "ENSG00000159346"
##  [49] "ENSG00000106617" "ENSG00000119396" "ENSG00000140992" "ENSG00000135932"
##  [53] "ENSG00000158571" "ENSG00000123836" "ENSG00000170525" "ENSG00000114268"
##  [57] "ENSG00000141959" "ENSG00000152556" "ENSG00000067057" "ENSG00000121879"
##  [61] "ENSG00000051382" "ENSG00000171608" "ENSG00000145675" "ENSG00000105647"
##  [65] "ENSG00000115592" "ENSG00000132170" "ENSG00000092020" "ENSG00000113575"
##  [69] "ENSG00000104695" "ENSG00000105568" "ENSG00000137713" "ENSG00000221914"
##  [73] "ENSG00000156475" "ENSG00000074211" "ENSG00000073711" "ENSG00000066027"
##  [77] "ENSG00000068971" "ENSG00000078304" "ENSG00000112640" "ENSG00000154001"
##  [81] "ENSG00000082146" "ENSG00000132356" "ENSG00000162409" "ENSG00000111725"
##  [85] "ENSG00000131791" "ENSG00000181929" "ENSG00000175470" "ENSG00000141564"
##  [89] "ENSG00000152254" "ENSG00000104388" "ENSG00000110092" "ENSG00000106615"
##  [93] "ENSG00000108443" "ENSG00000175634" "ENSG00000099194" "ENSG00000182158"
##  [97] "ENSG00000181856" "ENSG00000072310" "ENSG00000118046" "ENSG00000135341"
## [101] "ENSG00000165699" "ENSG00000103197" "ENSG00000006831" "ENSG00000145284"
## [105] "ENSG00000102547" "ENSG00000177169" "ENSG00000204673" "ENSG00000060566"
## [109] "ENSG00000133124" "ENSG00000117461" "ENSG00000185950" "ENSG00000130957"
## [113] "ENSG00000145386" "ENSG00000133101" "ENSG00000157613" "ENSG00000185236"
## [117] "ENSG00000266173" "ENSG00000141349" "ENSG00000181092" "ENSG00000135218"
## [121] "ENSG00000146592"
## 
## $`ATP-dependent chromatin remodeling`
##   [1] "ENSG00000167797" "ENSG00000187778" "ENSG00000106400" "ENSG00000172977"
##   [5] "ENSG00000080603" "ENSG00000183207" "ENSG00000112983" "ENSG00000185787"
##   [9] "ENSG00000170004" "ENSG00000111642" "ENSG00000076108" "ENSG00000198604"
##  [13] "ENSG00000229674" "ENSG00000258315" "ENSG00000153391" "ENSG00000189079"
##  [17] "ENSG00000171634" "ENSG00000164508" "ENSG00000112624" "ENSG00000184402"
##  [21] "ENSG00000135999" "ENSG00000099954" "ENSG00000169592" "ENSG00000166164"
##  [25] "ENSG00000105619" "ENSG00000123636" "ENSG00000063169" "ENSG00000277075"
##  [29] "ENSG00000196866" "ENSG00000188486" "ENSG00000164032" "ENSG00000116478"
##  [33] "ENSG00000196591" "ENSG00000184270" "ENSG00000230797" "ENSG00000277858"
##  [37] "ENSG00000274183" "ENSG00000170322" "ENSG00000116750" "ENSG00000077080"
##  [41] "ENSG00000048649" "ENSG00000071655" "ENSG00000148229" "ENSG00000104472"
##  [45] "ENSG00000071243" "ENSG00000128908" "ENSG00000167491" "ENSG00000114933"
##  [49] "ENSG00000163939" "ENSG00000101189" "ENSG00000130024" "ENSG00000099284"
##  [53] "ENSG00000246705" "ENSG00000178028" "ENSG00000143614" "ENSG00000049618"
##  [57] "ENSG00000057935" "ENSG00000183495" "ENSG00000162521" "ENSG00000102054"
##  [61] "ENSG00000133884" "ENSG00000075624" "ENSG00000110987" "ENSG00000075089"
##  [65] "ENSG00000163875" "ENSG00000102038" "ENSG00000080503" "ENSG00000127616"
##  [69] "ENSG00000099956" "ENSG00000028310" "ENSG00000173473" "ENSG00000139613"
##  [73] "ENSG00000066117" "ENSG00000108604" "ENSG00000082014" "ENSG00000073584"
##  [77] "ENSG00000141380" "ENSG00000163159" "ENSG00000184009" "ENSG00000288859"
##  [81] "ENSG00000100811" "ENSG00000101442" "ENSG00000120616" "ENSG00000127337"
##  [85] "ENSG00000111328" "ENSG00000205683" "ENSG00000011332" "ENSG00000117713"
##  [89] "ENSG00000196367" "ENSG00000196747" "ENSG00000275221" "ENSG00000276368"
##  [93] "ENSG00000276903" "ENSG00000180573" "ENSG00000278463" "ENSG00000278677"
##  [97] "ENSG00000288825" "ENSG00000184260" "ENSG00000115274" "ENSG00000277745"
## [101] "ENSG00000156531" "ENSG00000153147" "ENSG00000274997" "ENSG00000136518"
## [105] "ENSG00000175792" "ENSG00000134046" "ENSG00000196787" "ENSG00000009954"
## [109] "ENSG00000182979" "ENSG00000149480" "ENSG00000099385" "ENSG00000106635"
## [113] "ENSG00000181218" "ENSG00000113812" "ENSG00000105968" "ENSG00000113648"
## [117] "ENSG00000123562"
## 
## $`Acute myeloid leukemia`
##  [1] "ENSG00000117020" "ENSG00000245848" "ENSG00000092067" "ENSG00000102096"
##  [5] "ENSG00000213341" "ENSG00000182578" "ENSG00000164400" "ENSG00000139318"
##  [9] "ENSG00000187840" "ENSG00000142208" "ENSG00000105221" "ENSG00000150337"
## [13] "ENSG00000122025" "ENSG00000198793" "ENSG00000177885" "ENSG00000174775"
## [17] "ENSG00000104365" "ENSG00000164399" "ENSG00000169896" "ENSG00000078061"
## [21] "ENSG00000173801" "ENSG00000157404" "ENSG00000133703" "ENSG00000005381"
## [25] "ENSG00000136997" "ENSG00000109320" "ENSG00000213281" "ENSG00000138795"
## [29] "ENSG00000121879" "ENSG00000051382" "ENSG00000137193" "ENSG00000171608"
## [33] "ENSG00000145675" "ENSG00000105647" "ENSG00000140464" "ENSG00000112033"
## [37] "ENSG00000100030" "ENSG00000102882" "ENSG00000169032" "ENSG00000126934"
## [41] "ENSG00000002330" "ENSG00000132155" "ENSG00000131759" "ENSG00000110092"
## [45] "ENSG00000140379" "ENSG00000173039" "ENSG00000108443" "ENSG00000175634"
## [49] "ENSG00000115904" "ENSG00000100485" "ENSG00000066336" "ENSG00000157764"
## [53] "ENSG00000168610" "ENSG00000126561" "ENSG00000173757" "ENSG00000081059"
## [57] "ENSG00000148737" "ENSG00000109906" "ENSG00000152284" "ENSG00000117461"
## [61] "ENSG00000269335" "ENSG00000159216" "ENSG00000079102" "ENSG00000132326"
## [65] "ENSG00000145386" "ENSG00000133101" "ENSG00000170458"

Now, let’s summarize the counts data using NMF data. For this example, let’s use the default minimum size of sets (10). Note that when default minsize is used, it is not necessary to use this parameter in the function call:

pathway_activity <- summarize_pathway_level(assay(airway, "counts"), kegg_sets, type = "sd") # note that the NMF operation can take some minutes
## 
## 359 functional sets read.
## iteration 100
## iteration 200
## iteration 300
## 343 successful functional aggregations over minsize
## 16 failed functional aggregations under minsize
print(paste("Pathway activity score matrix has dimensions:", nrow(pathway_activity), ",", ncol(pathway_activity)))
## [1] "Pathway activity score matrix has dimensions: 343 , 8"
head(pathway_activity)
##                                                      SRR1039508 SRR1039509
## 2-Oxocarboxylic acid metabolism                        897.8148   933.7587
## ABC transporters                                      1343.7024  1247.1723
## AGE-RAGE signaling pathway in diabetic complications 41059.3414 27919.9726
## AMPK signaling pathway                                6261.0234  5764.0069
## ATP-dependent chromatin remodeling                    6221.1335  7862.9219
## Acute myeloid leukemia                                1540.3614  1571.4687
##                                                      SRR1039512 SRR1039513
## 2-Oxocarboxylic acid metabolism                        1176.073   700.9372
## ABC transporters                                       1700.936  1034.8811
## AGE-RAGE signaling pathway in diabetic complications  61813.278 31134.0064
## AMPK signaling pathway                                 7173.520  3452.9315
## ATP-dependent chromatin remodeling                     8738.562  8199.3336
## Acute myeloid leukemia                                 2372.710  1613.5125
##                                                      SRR1039516 SRR1039517
## 2-Oxocarboxylic acid metabolism                        1235.497   1476.768
## ABC transporters                                       2075.315   2100.908
## AGE-RAGE signaling pathway in diabetic complications  44791.910  50360.551
## AMPK signaling pathway                                 6498.032   6844.707
## ATP-dependent chromatin remodeling                     6647.548  19701.660
## Acute myeloid leukemia                                 2178.260   3040.191
##                                                      SRR1039520 SRR1039521
## 2-Oxocarboxylic acid metabolism                        818.0049   958.0687
## ABC transporters                                      1482.2698  1751.8586
## AGE-RAGE signaling pathway in diabetic complications 46530.1838 42718.1856
## AMPK signaling pathway                                5062.8806  5557.3168
## ATP-dependent chromatin remodeling                    6034.5251  9437.6834
## Acute myeloid leukemia                                1703.6101  2032.9305

The resulting matrix of pathway-level activity scores can be further analyzed as an independent dataset, or can also be integrated with the airway SummarizedExperiment object in a MultiAssayExperiment structure (note that SummarizedExperiment can simultaneously manage several experimental assays only if they have the same dimensions, which is not the case here, hence the need for a MultiAssayExperiment object). The MultiAssayExperiment library has to be loaded, and a MultiAssayExperiment (airwayMultiAssay) can be created and filled with a list of assays-like matrices that may have different dimensions. Here, airwayMultiAssay contains the counts and the recently generated KEGG pathway activity scores by standard deviation pooling.

library(MultiAssayExperiment)
assays_list <- list( counts = assay(airway, "counts"), kegg_sd_agg = pathway_activity)
airwayMultiAssay <- MultiAssayExperiment(experiments=assays_list)
colData(airwayMultiAssay) <- colData(airway)
airwayMultiAssay
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] counts: matrix with 63677 rows and 8 columns
##  [2] kegg_sd_agg: matrix with 343 rows and 8 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

In this illustrative example, the RNA sequencing data in airway package has directly been summarized or aggregated by the summarize_pathway_level function of the funOmics package without intermediate processing. Depending on the type of omics data, you may want to apply corresponding processing steps to the omics abundance matrix prior to the aggregation into higher level functional features. You can also “post-process” the resulting summarized matrix as you see appropriate for your analyses and workflows.

Contact Information

Feedback is very welcome! If you have any questions, issues, or suggestions for improving the funOmics package, please use the GitHub issues page or contact .

License

The funOmics package is released under the terms of the MIT License. See the LICENSE file for more details.