EnrichDO 0.99.2
Disease Ontology (DO) enrichment analysis is an effective means to discover the associations between genes and diseases. However, most DO-based enrichment methods were unable to solve the over enriched problem caused by the “true-path” rule. To address this problem, we propose a global weighted model termed EnrichDO. Based on the latest annotation of the human genome with DO terms, EnrichDO aims to identify locally significant enriched nodes by comprehensively considering the DO graph topology, and assigning different initial weights and dynamic weights for annotated genes. EnrichDO encompasses a variety of statistical models and visualization schemes for discovering the disease-gene relationship under biological big data. Currently uploaded to Bioconductor, we anticipate that our R package will provide more convenient and reliable analysis outcomes.
library(EnrichDO)
#>
EnrichDO supports enrichment analysis of Disease Ontology (DO), moreover, the disease ontology structure was used to construct a directed acyclic graph (DAG), and the weight reduction algorithm was designed to iterate layer by layer on DAG to reduce over-enrichment caused by “true-path” principle. A variety of statistical models for over-representation calculation and P-value correction methods are also provided.
In EnrichDO, we implemented doEnrich to realize the enrichment analysis of ontology by combining topological properties of ontology graph structure.
doEnrich has ten parameters, details:
interestGenes is the interest protein coding gene set and the input is entrez id .
test sets the statistical model of over-representation calculation , which can be “fisherTest”, “hypergeomTest”, “binomTest”, “chisqTest” and “logoddTest” (default is “hypergeomTest”).
method can set the P-value correction method, which can be “holm”,“hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,“fdr” and “none” (default is “BH”).
m sets the maximum number of ancestor layers for ontology enrichment (default is layer 1).
The maxGsize (and minGsize) indicates that doterms with more annotation genes than maxGsize (and less than minGsize) are ignored, and the P value of these doterms is set to 1(default maxGsize is 5000, minGsize is 5).
The traditional is a logical parameter, TRUE for traditional enrichment analysis, FALSE for enrichment analysis with weights (Default is FALSE).
The delta setting the threshold of nodes. If the p value of doterm is greater than delta, the node is not significant and is not weighted (Default is 0.01).
The result_do receives the file output by the wrireResult function, which is used to visually display the enrichment results (without running the enrichment operation again). Default is NULL.
The penalize is a logical value, and if TRUE, the algorithm reduces the weight again for nodes that are not significant by comparison. Default is TRUE.
In the following example, several genes (demo.data) are randomly selected from the protein-coding genes for analysis. The parameters of doEnrich is default.
demo.data=c(1636,351,102,2932,3077,348,4137,54209,5663,5328,23621,3416,3553)
doEnrich(interestGenes = demo.data,
test = "hypergeomTest",
method = "BH",
m = 1,
maxGsize = 5000,
minGsize = 5,
traditional = FALSE,
delta = 0.01,
result_do = NULL,
penalize = T)
# [1] "Descending rights test"
# LEVEL: 13 1 nodes 72 genes to be scored
# LEVEL: 12 2 nodes 457 genes to be scored
# LEVEL: 11 3 nodes 907 genes to be scored
# LEVEL: 10 13 nodes 2279 genes to be scored
# LEVEL: 9 54 nodes 6504 genes to be scored
# LEVEL: 8 130 nodes 9483 genes to be scored
# LEVEL: 7 198 nodes 11209 genes to be scored
# LEVEL: 6 220 nodes 12574 genes to be scored
# LEVEL: 5 198 nodes 12936 genes to be scored
# LEVEL: 4 103 nodes 12824 genes to be scored
# LEVEL: 3 30 nodes 11683 genes to be scored
# LEVEL: 2 5 nodes 8032 genes to be scored
# LEVEL: 1 0 nodes 0 genes to be scored
# [1] "BH"
# [1] "hypergeomTest"
From the above output results, we can observe the nodes and total genes involved in each layer of DAG structure, as well as the enrichment analysis method and statistical test model used.
The default enrichment result of the doenrich function is stored in enrich, and it can also be stored in different parameters to avoid overwriting when run again.
weight_result<-doEnrich(interestGenes = demo.data)
The results of the weighted enrichment analysis algorithm are as follows:
head(enrich)
#> DOID level gene.arr weight.arr parent.arr parent.len
#> 1 DOID:0080832 4 23607, 5.... 1, 1, 1,.... DOID:1561 1
#> 2 DOID:1307 4 5021, 67.... 0.9, 0.9.... DOID:1561 1
#> 3 DOID:10652 7 355, 965.... 1, 0.9, .... DOID:680 1
#> 4 DOID:14330 7 5063, 66.... 0.8, 1, .... DOID:0050890 1
#> 5 DOID:680 6 355, 965.... 0.9, 0.8.... DOID:1289 1
#> 6 DOID:0081292 6 5660, 63.... 1, 1, 1,.... DOID:936 1
#> child.arr child.len gene.len DOTerm gene.w
#> 1 0 160 mild cognitive impairment 1, 1, 1,....
#> 2 DOID:122.... 3 753 dementia 0.9, 0.9....
#> 3 0 1388 Alzheimer's disease 1, 0.9, ....
#> 4 DOID:0060892 1 769 Parkinson's disease 0.8, 1, ....
#> 5 DOID:008.... 2 1396 tauopathy 0.740051....
#> 6 0 174 traumatic brain injury 1, 1, 1,....
#> p cg.arr cg.len ig.len p.adjust
#> 1 9.223038e-16 5663, 35.... 9 13 4.439048e-12
#> 2 1.754624e-14 351, 413.... 12 13 4.222502e-11
#> 3 3.107814e-14 3416, 56.... 13 13 4.985970e-11
#> 4 3.892282e-13 2932, 34.... 11 13 4.683388e-10
#> 5 2.813577e-12 3416, 56.... 13 13 2.708349e-09
#> 6 3.859267e-11 351, 163.... 7 13 3.095775e-08
The result of doEnrich consists of data frame enrich and doterms which have been written into environment variables. There are 16 columns of enrich, including:
the DOterm ID on enrichment (DOID),
the hierarchy of the DOterm in the DAG graph (level),
all genes related to the DOterm (gene.arr),
gene weights in each node (weight.arr),
the parent node of the DOterm (parent.arr) and its number (parent.len).
child nodes of the DOterm (child.arr) and its number (child.len),
the number of all genes related to the DOterm (gene.len),
the standard name of the DOterm (DOTerm),
the weight of annotated genes (gene.w),
the P-value of the DOterm (p), which arrange the order of enrich, and the value of P-value correction (p.adjust),
the genes of interest annotated to this DOterm (cg.arr) and its number (cg.len),
the number of genes in the interest gene set (ig.len).
The data frame doterms contains the information of the disease ontology for DAG construction. doterms has ten columns including DOID, level, gene.arr, weight.arr, parent.arr, parent.len, child.arr, child.len, gene.len, and DOTerm.
head(doterms)
#> DOID level gene.arr weight.arr parent.arr parent.len
#> 1 DOID:0001816 7 7122, 20.... 1, 1, 1,.... DOID:175 1
#> 2 DOID:0002116 7 7442, 61.... 1, 1, 1,.... DOID:10124 1
#> 3 DOID:0014667 2 8772, 71.... 0.8, 0.9.... DOID:4 1
#> 4 DOID:0040001 9 3119, 31.... 1, 1, 1 DOID:0060524 1
#> 5 DOID:0040083 8 4973, 5468 1, 1 DOID:874 1
#> 6 DOID:0040085 4 7099, 44.... 1, 1, 1,.... DOID:104 1
#> child.arr child.len gene.len DOTerm
#> 1 0 56 angiosarcoma
#> 2 0 105 pterygium
#> 3 DOID:006.... 3 3513 disease of metabolism
#> 4 0 3 shrimp allergy
#> 5 0 2 Chlamydia pneumonia
#> 6 0 6 bacterial sepsis
1.Weighted enrichment analysis with multiple parameters. Each parameter in the following example is suitable for enrichment analysis with weights.
doEnrich(interestGenes= demo.data,
test = "hypergeomTest",
method = "holm",
m = 1,
minGsize = 5,
maxGsize = 500,
delta = 0.01,
penalize = T)
2.The parameter penalize was used to alleviate the impact of different magnitudes of p-values, default value is TRUE. When set to false, the degree of reduction in weight for non-significant nodes is decreased, resulting in a slight increase in significance for these nodes, i.e., their p-value will be reduced.
doEnrich(interestGenes = demo.data, penalize = F)
2.Using the traditional enrichment analysis method, it doesn’t reduce weights according to the DAG structure. Parameters test, method, m, maxGsize and minGsize can be used flexibly.
doEnrich(demo.data , traditional = TRUE)
# [1] "Traditional test"
# [1] "BH"
# [1] "hypergeomTest"
writeDoTerms can output DOID, DOTerm, level, genes, parents, children, gene.len, parent.len and child.len in the data frame doterms as text. The default file name is “doterms.txt”.
writeDoTerms(doterms,file = "doterms.txt")
The writeResult function can output DOID, DOTerm, p, p.adjust, geneRatio, bgRatio and cg in the data frame enrich as text. The default file name is “result.txt”.
geneRatio represents the intersection of the doterm with the interest set divided by the interest gene set, and bgRatio represents all genes of the doterm divided by the background gene set.
writeResult has four parameters. enrich indicates the enrichment result of doEnrich, file indicates the write address of a file. The parameter Q (and P) indicates that doterm is output only when p.adjust (and p value) is less than or equal to Q (and P). The default values for P and Q are 1.
writeResult(enrich,file = "result.txt",Q=1,P=1)
EnrichDO provides four methods to visualize enrichment results, including bar plot (drawBarGraph), bubble plot (drawPointGraph), tree plot (drawGraphviz) and heatmap (drawHeatmap), which can show the research results more concisly and clearly. Pay attention to the threshold setting for each drawing style , if the threshold is too low, the display is insufficient.
drawBarGraph can draw the top n nodes with the most significant p-value as bar chart, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).
drawBarGraph(enrich,n=10,delta = 0.05)
Figure 1: bar plot
drawPointGraph can draw the top n nodes with the most significant p-value as bubble plot, and the node’s p-value is less than delta (By default, n is 10 and delta is 1e-15).
drawPointGraph(enrich,n=10,delta = 0.05)
Figure 2: point plot
drawGraphViz draws the DAG structure of the most significant n nodes, and labelfontsize can set the font size of labels in nodes (By default, n is 10 and labelfontsize is 14). The characters in the figure are the doterm’s name corresponding to each node .
In addition, the drawGraphViz function can also display the P-value of each node in the enrichment analysis (pview=TRUE), and the number of overlapping genes of each doterm and interest set (numview=TRUE).
drawGraphViz(enrich, n=10, numview=FALSE, pview=FALSE,labelfontsize = 17)
#> chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#> chr [1:3] "DOID:1561" "DOID:150" "DOID:4"
#> chr [1:6] "DOID:680" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#> chr [1:6] "DOID:0050890" "DOID:1289" "DOID:331" "DOID:863" "DOID:7" ...
#> chr [1:5] "DOID:1289" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#> chr [1:5] "DOID:936" "DOID:331" "DOID:863" "DOID:7" "DOID:4"
#> chr [1:7] "DOID:649" "DOID:0050117" "DOID:936" "DOID:4" "DOID:331" ...
#> chr [1:4] "DOID:0080599" "DOID:934" "DOID:0050117" "DOID:4"
#> chr [1:4] "DOID:2468" "DOID:1561" "DOID:150" "DOID:4"
#> chr [1:2] "DOID:0014667" "DOID:4"
Figure 3: tree plot
drawHeatmap function visualizes the strength of the relationship between the top DOID_n nodes from enrichment results and the genes whose weight sum ranks the top gene_n in these nodes. And the gene displayed must be included in the gene of interest. readable indicates whether the gene is displayed as its symbol.
drawHeatmap also provides additional parameters from the pheatmap function, which you can set according to your needs. Default DOID_n is10, gene_n is 50, fontsize_row is 10, readable is TRUE.
drawHeatmap(interestGenes=demo.data,
enrich = enrich,
gene_n = 10,
fontsize_row = 8,
readable=T)
#> [31mgene symbol conversion result: [39m
#>
#> 'select()' returned 1:1 mapping between keys and columns
Figure 4: heatmap
Draw(drawBarGraph ,drawPointGraph ,drawGraphViz) from wrireResult output files, so you don’t have to wait for the algorithm to run.
#Firstly, read the wrireResult output file,using the following two lines
#data<-read.delim(yourfile)
#doEnrich(result_do = data)
#then, Use the drawing function you need
drawGraphViz(enrich) #Tree diagram
drawPointGraph(enrich) #Bubble diagram
drawBarGraph(enrich) #Bar plot