1 Introduction

Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we propose a global weighted model termed EnrichDO. Based on the latest annotation of the human genome with DO terms, EnrichDO aims to identify locally significant enriched nodes by comprehensively considering the DO graph topology, and assigning different initial weights and dynamic weights for annotated genes. EnrichDO encompasses a variety of statistical models and visualization schemes for discovering the disease-gene relationship under biological big data. Currently uploaded to Bioconductor, we anticipate that our R package will provide more convenient and reliable analysis outcomes.

library(EnrichDO)
#> 

2 Disease weighted over-representation annalysis

EnrichDO supports enrichment analysis of Disease Ontology (DO), moreover, the disease ontology structure was used to construct a directed acyclic graph (DAG), and the weight reduction algorithm was designed to iterate layer by layer on DAG to reduce over-enrichment caused by “true-path” principle. A variety of statistical models for over-representation calculation and P-value correction methods are also provided.

2.1 doEnrich function

In EnrichDO, we implemented doEnrich to realize the enrichment analysis of ontology by combining topological properties of ontology graph structure.

2.1.1 Parameter introduction

doEnrich has ten parameters, details:

  • interestGenes is the interest protein coding gene set and the input is entrez id .

  • test sets the statistical model of over-representation calculation , which can be “fisherTest”, “hypergeomTest”, “binomTest”, “chisqTest” and “logoddTest” (default is “hypergeomTest”).

  • method can set the P-value correction method, which can be “holm”,“hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,“fdr” and “none” (default is “BH”).

  • m sets the maximum number of ancestor layers for ontology enrichment (default is layer 1).

  • The maxGsize (and minGsize) indicates that doterms with more annotation genes than maxGsize (and less than minGsize) are ignored, and the P value of these doterms is set to 1(default maxGsize is 5000, minGsize is 5).

  • The traditional is a logical parameter, TRUE for traditional enrichment analysis, FALSE for enrichment analysis with weights (Default is FALSE).

  • The delta setting the threshold of nodes. If the p value of doterm is greater than delta, the node is not significant and is not weighted (Default is 0.01).

  • The result_do receives the file output by the wrireResult function, which is used to visually display the enrichment results (without running the enrichment operation again). Default is NULL.

  • The penalize is a logical value, and if TRUE, the algorithm reduces the weight again for nodes that are not significant by comparison. Default is TRUE.

2.1.2 Result description

In the following example, several genes (demo.data) are randomly selected from the protein-coding genes for analysis. The parameters of doEnrich is default.

demo.data=c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553)
doEnrich(interestGenes = demo.data, 
          test         = "hypergeomTest", 
          method       = "BH", 
          m            = 1, 
          maxGsize     = 5000,
          minGsize     = 5,
          traditional  = FALSE,
          delta        = 0.01, 
          result_do    = NULL,
          penalize     = T)

# [1] "Descending rights test"
# LEVEL: 13 1 nodes 72 genes to be scored
# LEVEL: 12 2 nodes 457 genes to be scored
# LEVEL: 11 3 nodes 907 genes to be scored
# LEVEL: 10 13 nodes    2279 genes to be scored
# LEVEL: 9  54 nodes    6504 genes to be scored
# LEVEL: 8  130 nodes   9483 genes to be scored
# LEVEL: 7  198 nodes   11209 genes to be scored
# LEVEL: 6  220 nodes   12574 genes to be scored
# LEVEL: 5  198 nodes   12936 genes to be scored
# LEVEL: 4  103 nodes   12824 genes to be scored
# LEVEL: 3  30 nodes    11683 genes to be scored
# LEVEL: 2  5 nodes 8032 genes to be scored
# LEVEL: 1  0 nodes 0 genes to be scored
# [1] "BH"
# [1] "hypergeomTest"

From the above output results, we can observe the nodes and total genes involved in each layer of DAG structure, as well as the enrichment analysis method and statistical test model used.

The default enrichment result of the doenrich function is stored in enrich, and it can also be stored in different parameters to avoid overwriting when run again.

weight_result<-doEnrich(interestGenes = demo.data)

The results of the weighted enrichment analysis algorithm are as follows:

head(enrich)
#>           DOID level     gene.arr   weight.arr   parent.arr parent.len
#> 1 DOID:0080832     4 23607, 5.... 1, 1, 1,....    DOID:1561          1
#> 2    DOID:1307     4 5021, 67.... 0.9, 0.9....    DOID:1561          1
#> 3   DOID:10652     7 355, 965.... 1, 0.9, ....     DOID:680          1
#> 4   DOID:14330     7 5063, 66.... 0.8, 1, .... DOID:0050890          1
#> 5     DOID:680     6 355, 965.... 0.9, 0.8....    DOID:1289          1
#> 6 DOID:0081292     6 5660, 63.... 1, 1, 1,....     DOID:936          1
#>      child.arr child.len gene.len                    DOTerm       gene.w
#> 1                      0      160 mild cognitive impairment 1, 1, 1,....
#> 2 DOID:122....         3      753                  dementia 0.9, 0.9....
#> 3                      0     1388       Alzheimer's disease 1, 0.9, ....
#> 4 DOID:0060892         1      769       Parkinson's disease 0.8, 1, ....
#> 5 DOID:008....         2     1396                 tauopathy 0.740051....
#> 6                      0      174    traumatic brain injury 1, 1, 1,....
#>              p       cg.arr cg.len ig.len     p.adjust
#> 1 9.223038e-16 5663, 35....      9     13 4.439048e-12
#> 2 1.754624e-14 351, 413....     12     13 4.222502e-11
#> 3 3.107814e-14 3416, 56....     13     13 4.985970e-11
#> 4 3.892282e-13 2932, 34....     11     13 4.683388e-10
#> 5 2.813577e-12 3416, 56....     13     13 2.708349e-09
#> 6 3.859267e-11 351, 163....      7     13 3.095775e-08

The result of doEnrich consists of data frame enrich and doterms which have been written into environment variables. There are 16 columns of enrich, including:

  • the DOterm ID on enrichment (DOID),

  • the hierarchy of the DOterm in the DAG graph (level),

  • all genes related to the DOterm (gene.arr),

  • gene weights in each node (weight.arr),

  • the parent node of the DOterm (parent.arr) and its number (parent.len).

  • child nodes of the DOterm (child.arr) and its number (child.len),

  • the number of all genes related to the DOterm (gene.len),

  • the standard name of the DOterm (DOTerm),

  • the weight of annotated genes (gene.w),

  • the P-value of the DOterm (p), which arrange the order of enrich, and the value of P-value correction (p.adjust),

  • the genes of interest annotated to this DOterm (cg.arr) and its number (cg.len),

  • the number of genes in the interest gene set (ig.len).

The data frame doterms contains the information of the disease ontology for DAG construction. doterms has ten columns including DOID, level, gene.arr, weight.arr, parent.arr, parent.len, child.arr, child.len, gene.len, and DOTerm.

head(doterms)
#>           DOID level     gene.arr   weight.arr   parent.arr parent.len
#> 1 DOID:0001816     7 7122, 20.... 1, 1, 1,....     DOID:175          1
#> 2 DOID:0002116     7 7442, 61.... 1, 1, 1,....   DOID:10124          1
#> 3 DOID:0014667     2 8772, 71.... 0.8, 0.9....       DOID:4          1
#> 4 DOID:0040001     9 3119, 31....      1, 1, 1 DOID:0060524          1
#> 5 DOID:0040083     8   4973, 5468         1, 1     DOID:874          1
#> 6 DOID:0040085     4 7099, 44.... 1, 1, 1,....     DOID:104          1
#>      child.arr child.len gene.len                DOTerm
#> 1                      0       56          angiosarcoma
#> 2                      0      105             pterygium
#> 3 DOID:006....         3     3513 disease of metabolism
#> 4                      0        3        shrimp allergy
#> 5                      0        2   Chlamydia pneumonia
#> 6                      0        6      bacterial sepsis

2.1.3 Multiple using of doEnrich function

1.Weighted enrichment analysis with multiple parameters. Each parameter in the following example is suitable for enrichment analysis with weights.

doEnrich(interestGenes= demo.data,
         test         = "hypergeomTest",
         method       = "holm",
         m            = 1,
         minGsize     = 5,
         maxGsize     = 500,
         delta        = 0.01,
         penalize     = T)

2.The parameter penalize was used to alleviate the impact of different magnitudes of p-values, default value is TRUE. When set to false, the degree of reduction in weight for non-significant nodes is decreased, resulting in a slight increase in significance for these nodes, i.e., their p-value will be reduced.

doEnrich(interestGenes = demo.data, penalize = F)

2.Using the traditional enrichment analysis method, it doesn’t reduce weights according to the DAG structure. Parameters test, method, m, maxGsize and minGsize can be used flexibly.

doEnrich(demo.data , traditional = TRUE)

# [1] "Traditional test"
# [1] "BH"
# [1] "hypergeomTest"

2.2 writeDoTerms function

writeDoTerms can output DOID, DOTerm, level, genes, parents, children, gene.len, parent.len and child.len in the data frame doterms as text. The default file name is “doterms.txt”.

writeDoTerms(doterms,file = "doterms.txt")

2.3 writeResult function

The writeResult function can output DOID, DOTerm, p, p.adjust, geneRatio, bgRatio and cg in the data frame enrich as text. The default file name is “result.txt”.

geneRatio represents the intersection of the doterm with the interest set divided by the interest gene set, and bgRatio represents all genes of the doterm divided by the background gene set.

writeResult has four parameters. enrich indicates the enrichment result of doEnrich, file indicates the write address of a file. The parameter Q (and P) indicates that doterm is output only when p.adjust (and p value) is less than or equal to Q (and P). The default values for P and Q are 1.

writeResult(enrich,file = "result.txt",Q=1,P=1)

3 Visualization of enrichment results

EnrichDO provides four methods to visualize enrichment results, including bar plot (drawBarGraph), bubble plot (drawPointGraph), tree plot (drawGraphviz) and heatmap (drawHeatmap), which can show the research results more concisly and clearly. Pay attention to the threshold setting for each drawing style , if the threshold is too low, the display is insufficient.

3.1 drawBarGraph function

drawBarGraph can draw the top n nodes with the most significant p-value as bar chart, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawBarGraph(enrich,n=10,delta = 0.05)
bar plot

Figure 1: bar plot

3.2 drawPointGraph function

drawPointGraph can draw the top n nodes with the most significant p-value as bubble plot, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).

drawPointGraph(enrich,n=10,delta = 0.05)
point plot

Figure 2: point plot

3.3 drawGraphViz function

drawGraphViz draws the DAG structure of the most significant n nodes, and labelfontsize can set the font size of labels in nodes (By default, n is 10 and labelfontsize is 14). The characters in the figure are the doterm’s name corresponding to each node .

In addition, the drawGraphViz function can also display the P-value of each node in the enrichment analysis (pview=TRUE), and the number of overlapping genes of each doterm and interest set (numview=TRUE).


drawGraphViz(enrich, n=10, numview=FALSE, pview=FALSE,labelfontsize = 17)
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#>  chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:5] "DOID:936" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#>  chr [1:7] "DOID:649" "DOID:0050117" "DOID:936" "DOID:4" "DOID:331" ...
#>  chr [1:4] "DOID:0080599" "DOID:934" "DOID:0050117" "DOID:4"
#>  chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"
#>  chr [1:2] "DOID:0014667" "DOID:4"
tree plot

Figure 3: tree plot

3.4 drawHeatmap function

drawHeatmap function visualizes the strength of the relationship between the top DOID_n nodes from enrichment results and the genes whose weight sum ranks the top gene_n in these nodes. And the gene displayed must be included in the gene of interest. readable indicates whether the gene is displayed as its symbol.

drawHeatmap also provides additional parameters from the pheatmap function, which you can set according to your needs. Default DOID_n is10, gene_n is 50, fontsize_row is 10, readable is TRUE.

drawHeatmap(interestGenes=demo.data,
enrich = enrich,
gene_n = 10,
fontsize_row = 8,
readable=T)
#> gene symbol conversion result: 
#> 
#> 'select()' returned 1:1 mapping between keys and columns
heatmap

Figure 4: heatmap

3.5 convenient drawing

Draw(drawBarGraph ,drawPointGraph ,drawGraphViz) from wrireResult output files, so you don’t have to wait for the algorithm to run.

#Firstly, read the wrireResult output file,using the following two lines
#data<-read.delim(yourfile)
#doEnrich(result_do = data)

#then, Use the drawing function you need
drawGraphViz(enrich)    #Tree diagram
drawPointGraph(enrich)  #Bubble diagram
drawBarGraph(enrich)    #Bar plot