--- title: "MeSH Enrichment and Semantic Analyses" author: "\\ Guangchuang Yu (<guangchuangyu@gmail.com>)\\ School of Public Health, The University of Hong Kong" date: "`r Sys.Date()`" bibliography: meshes.bib csl: nature.csl output: BiocStyle::html_document: toc: true BiocStyle::pdf_document: toc: true vignette: > % \VignetteIndexEntry{An introduction to meshes} % \VignetteEngine{knitr::rmarkdown} %\VignetteDepends{MeSH.Hsa.eg.db} %\VignetteDepends{MeSH.db} % \usepackage[utf8]{inputenc} --- ```{r style, echo=FALSE, results="asis", message=FALSE} BiocStyle::markdown() knitr::opts_chunk$set(tidy = FALSE, warning = FALSE, message = FALSE) ``` ```{r echo=FALSE, results="hide", message=FALSE} library(MeSH.Hsa.eg.db) library(MeSH.db) library(DOSE) library(meshes) ``` # Introduction MeSH (Medical Subject Headings) is the NLM (U.S. National Library of Medicine) controlled vocabulary used to manually index articles for MEDLINE/PubMed. MeSH is comprehensive life science vocabulary. MeSH has 19 categories and `MeSH.db` contains 16 of them. That is: <table> <thead> <tr class="header"> <th>Abbreviation</th> <th>Category</th> </tr> </thead> <tbody> <tr class="odd"> <td>A</td> <td>Anatomy</td> </tr> <tr class="even"> <td>B</td> <td>Organisms</td> </tr> <tr class="odd"> <td>C</td> <td>Diseases</td> </tr> <tr class="even"> <td>D</td> <td>Chemicals and Drugs</td> </tr> <tr class="odd"> <td>E</td> <td>Analytical, Diagnostic and Therapeutic Techniques and Equipment</td> </tr> <tr class="even"> <td>F</td> <td>Psychiatry and Psychology</td> </tr> <tr class="odd"> <td>G</td> <td>Phenomena and Processes</td> </tr> <tr class="even"> <td>H</td> <td>Disciplines and Occupations</td> </tr> <tr class="odd"> <td>I</td> <td>Anthropology, Education, Sociology and Social Phenomena</td> </tr> <tr class="even"> <td>J</td> <td>Technology and Food and Beverages</td> </tr> <tr class="odd"> <td>K</td> <td>Humanities</td> </tr> <tr class="even"> <td>L</td> <td>Information Science</td> </tr> <tr class="odd"> <td>M</td> <td>Persons</td> </tr> <tr class="even"> <td>N</td> <td>Health Care</td> </tr> <tr class="odd"> <td>V</td> <td>Publication Type</td> </tr> <tr class="even"> <td>Z</td> <td>Geographical Locations</td> </tr> </tbody> </table> MeSH terms were associated with Entrez Gene ID by three methods, `gendoo`, `gene2pubmed` and `RBBH` (Reciprocal Blast Best Hit). |Method|Way of corresponding Entrez Gene IDs and MeSH IDs| |------|-------------------------------------------------| |Gendoo|Text-mining| |gene2pubmed|Manual curation by NCBI teams| |RBBH|sequence homology with BLASTP search (E-value<10<sup>-50</sup>)| # Enrichment Analysis `meshes` supports enrichment analysis (over-representation analysis and gene set enrichment analysis) of gene list or whole expression profile using MeSH annotation. Data source from `gendoo`, `gene2pubmed` and `RBBH` are all supported. User can selecte interesting category to test. All 16 categories are supported. The analysis supports >70 species listed in [MeSHDb BiocView](https://bioconductor.org/packages/release/BiocViews.html#___MeSHDb). For algorithm details, please refer to the vignettes of `r Biocpkg("DOSE")`[@yu_dose_2015] package. ```{r} library(meshes) data(geneList, package="DOSE") de <- names(geneList)[1:100] x <- enrichMeSH(de, MeSHDb = "MeSH.Hsa.eg.db", database='gendoo', category = 'C') head(x) ``` In the over-representation analysis, we use data source from `gendoo` and `C` (Diseases) category. In the following example, we use data source from `gene2pubmed` and test category `G` (Phenomena and Processes) using GSEA. ```{r} y <- gseMeSH(geneList, MeSHDb = "MeSH.Hsa.eg.db", database = 'gene2pubmed', category = "G") head(y) ``` User can use visualization methods implemented in [DOSE](https://guangchuangyu.github.io/DOSE) (i.e.`barplot`, `dotplot`, `cnetplot`, `enrichMap`, `upsetplot` and `gseaplot`) to visualize these enrichment results. With these visualization methods, it's much easier to interpret enriched results. ```{r} dotplot(x) gseaplot(y, y[1,1], title=y[1,2]) ``` # Semantic Similarity `meshes` implemented four IC-based methods (i.e. Resnik[@philip_semantic_1999], Jiang[@jiang_semantic_1997], Lin[@lin_information-theoretic_1998] and Schlicker[@schlicker_new_2006]) and one graph-structure based method (i.e. Wang[@wang_new_2007]). For algorithm details, please refer to the vignette of `r Biocpkg("GOSemSim")` package[@yu2010] `meshSim` function is designed to measure semantic similarity between two MeSH term vectors. ```{r} library(meshes) ## hsamd <- meshdata("MeSH.Hsa.eg.db", category='A', computeIC=T, database="gendoo") data(hsamd) meshSim("D000009", "D009130", semData=hsamd, measure="Resnik") meshSim("D000009", "D009130", semData=hsamd, measure="Rel") meshSim("D000009", "D009130", semData=hsamd, measure="Jiang") meshSim("D000009", "D009130", semData=hsamd, measure="Wang") meshSim(c("D001369", "D002462"), c("D017629", "D002890", "D008928"), semData=hsamd, measure="Wang") ``` `geneSim` function is designed to measure semantic similarity among two gene vectors. ```{r} geneSim("241", "251", semData=hsamd, measure="Wang", combine="BMA") geneSim(c("241", "251"), c("835", "5261","241", "994"), semData=hsamd, measure="Wang", combine="BMA") ``` # Related tools ## Enrichment analysis - `r Biocpkg("DOSE")`[@yu_dose_2015] - `r Biocpkg("clusterProfiler")`[@yu2012] - `r Biocpkg("ReactomePA")`[@yu_reactomepa_2016] ## Semantic similairty measurement - `r Biocpkg("GOSemSim")`[@yu2010] - `r Biocpkg("DOSE")`[@yu_dose_2015] # Session Information Here is the output of `sessionInfo()` on the system on which this document was compiled: ```{r echo=FALSE} sessionInfo() ``` # References