This package serves as a query interface for important community collections of
small molecules, while also allowing users to include custom compound
collections. Both annotation and structure information is provided. The
annotation data is stored in an SQLite database, while the structure
information is stored in Structure Definition Files (SDF). Both are hosted
on Bioconductor’s AnnotationHub. A detailed description of the included
data types is provided under the Supplemental Material section of this vignette.
At the time of writing, the following community databases are included:
In addition to providing access to the above compound collections, the package
supports the integration of custom collections of compounds, that will be
automatically stored for the user in the same data structure as the
preconfigured databases. Both custom collections and those provided by this
package can be queried in a uniform manner, and then further analyzed with
cheminformatics packages such as ChemmineR, where SDFs are imported into
flexible S4 containers (Cao et al. 2008).
As Bioconductor package customCMPdb can be installed with the
BiocManager::install() function.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("customCMPdb")To obtain the most recent updates of the package immediately, one can also install it directly from GitHub as follows.
devtools::install_github("yduan004/customCMPdb", build_vignettes=TRUE)Next the package needs to be loaded in a user’s R session.
library(customCMPdb)
library(help = "customCMPdb")  # Lists package infoOpen vignette of this package.
browseVignettes("customCMPdb")  # Opens vignetteThe following introduces how to load and query the different datasets.
The compound annotation tables are stored in an SQLite database. This data can be
loaded into a user’s R session as follows (here for drugAgeAnnot).
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("customCMPdb", "annot_0.1"))## AnnotationHub with 1 record
## # snapshotDate(): 2020-10-26
## # names(): AH79563
## # $dataprovider: DrugAge, DrugBank, Broad Institute, Broad Institute
## # $species: Homo sapiens
## # $rdataclass: character
## # $rdatadateadded: 2020-10-14
## # $title: annot_0.1
## # $description: SQLite database containing compound annotations from four re...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: TSV
## # $sourceurl: https://bit.ly/3dCWKWo, https://github.com/yduan004/drugbankR,...
## # $sourcesize: NA
## # $tags: c("annot", "customCMPdb") 
## # retrieve record with 'object[["AH79563"]]'annot_path <- ah[["AH79563"]]
library(RSQLite)
conn <- dbConnect(SQLite(), annot_path)
dbListTables(conn)## [1] "DrugBankAnnot" "cmapAnnot"     "drugAgeAnnot"  "id_mapping"   
## [5] "lincsAnnot"drugAgeAnnot <- dbReadTable(conn, "drugAgeAnnot")
head(drugAgeAnnot)##   drugage_id                  compound_name    synonyms                 species
## 1   ida00001                        Vitexin        <NA>  Caenorhabditis elegans
## 2   ida00002                  Cyclosporin A        <NA>  Caenorhabditis elegans
## 3   ida00003                      Histidine L-histidine  Caenorhabditis elegans
## 4   ida00004                        SRT1720        <NA>            Mus musculus
## 5   ida00005 Cordyceps sinensis oral liquid        <NA> Drosophila melanogaster
## 6   ida00006                         Lysine        <NA>  Caenorhabditis elegans
##     strain                dosage avg_lifespan_change max_lifespan_change gender
## 1       N2                 50 µM                   8                 5.3   <NA>
## 2     <NA>                 88 µM                  18                <NA>   <NA>
## 3       N2                  5 mM                  10                <NA>   <NA>
## 4 C57BL/6J 100 mg/kg body weight                 8.8                   0   <NA>
## 5 Oregon-K            0.20 mg/ml                  32                15.4   Male
## 6       N2                  5 mM                   8                <NA>   <NA>
##   significance pubmed_id Comment pref_name     pubchem_cid      DrugBank_id
## 1         <NA>  26535084    <NA>   VITEXIN         5280441             <NA>
## 2         <NA>  24134630    <NA>      <NA>            <NA>          DB00091
## 3         <NA>  25643626    <NA> HISTIDINE   6274, 6971009          DB00117
## 4         <NA>  24582957    <NA>      <NA>            <NA>             <NA>
## 5         <NA>  26239097     Mix      <NA>            <NA>             <NA>
## 6         <NA>  25643626    <NA>    LYSINE 5962, 122198194 DB00123, DB11101dbDisconnect(conn)The corresponding structures for the above DrugAge example can be loaded into an SDFset
object as follows.
query(ah, c("customCMPdb", "drugage_build2"))
da_path <- ah[["AH79564"]]
da_sdfset <- ChemmineR::read.SDFset(da_path)Instructions on how to work with SDFset objects are provided in the ChemmineR vignette
here. For instance, one can plot any of
the loaded structures with the plot function.
ChemmineR::cid(da_sdfset) <- ChemmineR::sdfid(da_sdfset)
ChemmineR::plot(da_sdfset[1])The SDF from DrugBank can be loaded into R the same way. The
corresponding SDF file was downloaded from
here. During the import
into R ChemmineR checks the validity of the imported compounds.
query(ah, c("customCMPdb", "drugbank_5.1.5"))
db_path <- ah[["AH79565"]]
db_sdfset <- ChemmineR::read.SDFset(db_path)The import of the SDF of the CMAP02 database works the same way.
query(ah, c("customCMPdb", "cmap02"))
cmap_path <- ah[["AH79566"]]
cmap_sdfset <- ChemmineR::read.SDFset(cmap_path)The same applies to the SDF of the small molecules included in the LINCS database.
query(ah, c("customCMPdb", "lincs_pilot1"))
lincs_path <- ah[["AH79567"]]
lincs_sdfset <- ChemmineR::read.SDFset(lincs_path)For reproducibility, the R code for generating the above datasets is included
in the inst/scripts/make-data.R file of this package. The file location
on a user’s system can be obtained with system.file("scripts/make-data.R",  package="customCMPdb").
The SQLite Annotation Database is hosted on Bioconductor’s AnnotationHub.
Users can download it to a local AnnotationHub cache directory. The path to this
directory can be obtained as follows.
library(AnnotationHub)
ah <- AnnotationHub()
annot_path <- ah[["AH79563"]]The following introduces how users can import to the SQLite database
their own compound annotation tables. In this case, the corresponding
ChEMBL IDs need to be included under the chembl_id column.
The name of the custom data set can be specified under the annot_name
argument. Note, this name is case insensitive.
chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10",
               "CHEMBL100", "CHEMBL1000", NA)
annot_tb <- data.frame(cmp_name=paste0("name", 1:6),
        chembl_id=chembl_id,
        feature1=paste0("f", 1:6),
        feature2=rnorm(6))
addCustomAnnot(annot_tb, annot_name="myCustom")The following shows how to delete custom annotation tables
by referencing them by their name. To obtain a list of custom
annotation tables present in the database, the listAnnot function
can be used.
listAnnot()## [1] "DrugBankAnnot" "cmapAnnot"     "drugAgeAnnot"  "lincsAnnot"   
## [5] "myCustom"deleteAnnot("myCustom")
listAnnot()## [1] "DrugBankAnnot" "cmapAnnot"     "drugAgeAnnot"  "lincsAnnot"The defaultAnnot function sets the annotation SQLite database back to the
original version provided by customCMPdb. This is achieved by deleting the
existing (e.g. custom) database and re-downloading a fresh instance from
AnnotationHub.
defaultAnnot()The queryAnnotDB function can be used to query the compound annotations from
the default resources as well as the custom resources stored in the SQLite
annotation database. The query can be a set of ChEMBL IDs. In this case it
returns a data.frame containing the annotations of the matching compounds
from the selected annotation resources specified under the 
argument. The listAnnot function returns the names that can be assigned to
the annot argument.
query_id <- c("CHEMBL1064", "CHEMBL10", "CHEMBL113", "CHEMBL1004", "CHEMBL31574")
listAnnot()## [1] "DrugBankAnnot" "cmapAnnot"     "drugAgeAnnot"  "lincsAnnot"qres <- queryAnnotDB(query_id, annot=c("drugAgeAnnot", "lincsAnnot"))
qres##     chembl_id                  species        strain     dosage
## 1    CHEMBL10  Drosophila melanogaster      Oregon R     300 µM
## 2  CHEMBL1004                     <NA>          <NA>       <NA>
## 3  CHEMBL1064  Drosophila melanogaster          <NA>     240 µM
## 4   CHEMBL113  Drosophila melanogaster      Oregon R 0.01 mg/ml
## 5 CHEMBL31574 Saccharomyces cerevisiae PSY316AT MAT_      10 µM
##   avg_lifespan_change max_lifespan_change gender significance      lincs_id
## 1                30.3                <NA>   <NA>         <NA> BRD-A37704979
## 2                <NA>                <NA>   <NA>         <NA> BRD-A44008656
## 3                  25                <NA>   <NA>         <NA> BRD-K22134346
## 4               -10.1                <NA>   Male           NS BRD-K02404261
## 5                  55                <NA>   <NA>         <NA>          <NA>
##    pert_iname is_touchstone                   inchi_key pubchem_cid
## 1   SB-203580             0 CDMGBJANTYXAIV-UHFFFAOYSA-N      176155
## 2  doxylamine             1 HCFDWZZGGLSKEP-UHFFFAOYSA-N        -666
## 3 simvastatin             0 RYMZZMVNJRMUDD-HGQWONQESA-N        -666
## 4    caffeine             1 RYYVLZVUVIJVGH-UHFFFAOYSA-N        -666
## 5        <NA>            NA                        <NA>        <NA># query the added custom annotation
addCustomAnnot(annot_tb, annot_name="myCustom")
qres2 <- queryAnnotDB(query_id, annot=c("lincsAnnot", "myCustom"))
qres2##     chembl_id      lincs_id  pert_iname is_touchstone
## 1    CHEMBL10 BRD-A37704979   SB-203580             0
## 2  CHEMBL1004 BRD-A44008656  doxylamine             1
## 3  CHEMBL1064 BRD-K22134346 simvastatin             0
## 4   CHEMBL113 BRD-K02404261    caffeine             1
## 5 CHEMBL31574          <NA>        <NA>            NA
##                     inchi_key pubchem_cid cmp_name feature1  feature2
## 1 CDMGBJANTYXAIV-UHFFFAOYSA-N      176155    name3       f3 0.5089542
## 2 HCFDWZZGGLSKEP-UHFFFAOYSA-N        -666     <NA>     <NA>        NA
## 3 RYMZZMVNJRMUDD-HGQWONQESA-N        -666     <NA>     <NA>        NA
## 4 RYYVLZVUVIJVGH-UHFFFAOYSA-N        -666     <NA>     <NA>        NA
## 5                        <NA>        <NA>     <NA>     <NA>        NASince the supported compound databases use different identifiers, a ChEMBL ID mapping table is used to connect identical entries across databases as well as to link out to other resources such as ChEMBL itself or PubChem. For custom compounds, where ChEMBL IDs are not available yet, one can use alternative and/or custom identifiers.
query_id <- c("BRD-A00474148", "BRD-A00150179", "BRD-A00763758", "BRD-A00267231")
qres3 <- queryAnnotDB(chembl_id=query_id, annot=c("lincsAnnot"))
qres3##         lincs_id          pert_iname is_touchstone                   inchi_key
## 2  BRD-A00150179 5-hydroxytryptophan             0 QSHLMQDRPXXYEE-UHFFFAOYSA-N
## 3  BRD-A00267231              hemado             1 KOCIMZNSNPOGOP-UHFFFAOYSA-N
## 5  BRD-A00474148       BRD-A00474148             0 RCGAUPRLRFZAMS-UHFFFAOYSA-N
## 10 BRD-A00763758       BRD-A00763758             0 MASIPYZIHWNUPA-UHFFFAOYSA-N
##    pubchem_cid
## 2       589768
## 3      4043357
## 5     44825297
## 10    43209100The DrugAge database is manually curated by experts. It contains an extensive
compilation of drugs, compounds and supplements (including natural products and
nutraceuticals) with anti-aging properties that extend longevity in model
organisms (Barardo et al. 2017). The DrugAge database was downloaded from
here as a CSV file. The
downloaded drugage.csv file contains compound_name, synonyms, species, strain,
dosage, avg_lifespan_change, max_lifespan_change, gender, significance,
and pubmed_id annotation columns. Since the DrugAge database only contains the
drug name as identifiers, it is necessary to map the drug name to other uniform
drug identifiers, such as ChEMBL IDs. In this package,
the drug names have been mapped to ChEMBL (Gaulton et al. 2012),
[PubChem]((https://pubchem.ncbi.nlm.nih.gov/) (Kim et al. 2019) and DrugBank IDs semi-manually
and stored under the inst/extdata directory named as drugage_id_mapping.tsv.
Part of the id mappings in the drugage_id_mapping.tsv table is generated
by the  function for compound names that have ChEMBL
ids from the ChEMBL database (version 24). The missing IDs were added
manually. A semi-manual approach was to use this
web service. After the semi-manual process,
the left ones were manually mapped to ChEMBL, PubChem and DrugBank ids. The
entries that are mixture like green tee extract or peptide like Bacitracin were commented.
Then the drugage_id_mapping table was built into the annotation SQLite database
named as compoundCollection_0.1.db by buildDrugAgeDB function.
The DrugBank annotation table was downloaded from the DrugBank database
in xml file.
The most recent release version at the time of writing this document is 5.1.5.
The extracted xml file was processed by the  function in this package.
dbxml2df and df2SQLite functions in this package were used to load the xml
file into R and covert to a data.frame R object, then stored in the
compoundCollection SQLite annotation database.
There are 55 annotation columns in the DrugBank annotation table, such as
drugbank_id, name, description, cas-number, groups, indication,
pharmacodynamics, mechanism-of-action, toxicity, metabolism, half-life,
protein-binding, classification, synonyms, international-brands, packagers,
manufacturers, prices, dosages, atc-codes, fda-label, pathways, targets.
The DrugBank id to ChEMBL id mappings were obtained from
UniChem.
The CMAP02 annotation table was processed from the downloaded compound
instance table
using the buildCMAPdb function defined by this package. The CMAP02 instance table contains
the following drug annotation columns: instance_id, batch_id, cmap_name, INN1,
concentration (M), duration (h), cell2, array3, perturbation_scan_id,
vehicle_scan_id4, scanner, vehicle, vendor, catalog_number, catalog_name.
Drug names are used as drug identifies. The buildCMAPdb function maps the drug
names to external drug ids including UniProt (The UniProt Consortium 2017),
PubChem, DrugBank and ChemBank (Seiler et al. 2008) ids. It also adds additional
annotation columns such as directionality, ATC codes and SMILES structure.
The generated cmap.db SQLite database from buildCMAPdb function contains both
compound annotation table and structure information. The ChEMBL id mappings were
further added to the annotation table via PubChem CID to ChEMBL id mappings from
UniChem.
The CMAP02 annotation table was stored in the compoundCollection SQLite annotation
database. Then the CMAP internal IDs to ChEMBL id mappings were added to the ID
mapping table.
The LINCS compound annotation table was downloaded from
GEO
where only compounds were selected. The annotation columns are lincs_id, pert_name,
pert_type, is_touchstone, inchi_key_prefix, inchi_key, canonical_smiles, pubchem_cid.
The annotation table was stored in the compoundCollection SQLite annotation database.
Since the annotation only contains LINCS id to PubChem CID mapping, the LINCS ids
were also mapped to ChEMBL ids via inchi key.
sessionInfo()## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] RSQLite_2.2.1        AnnotationHub_2.22.0 BiocFileCache_1.14.0
## [4] dbplyr_1.4.4         BiocGenerics_0.36.0  ChemmineR_3.42.0    
## [7] customCMPdb_1.0.0    BiocStyle_2.18.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5                    rsvg_2.1                     
##  [3] png_0.1-7                     assertthat_0.2.1             
##  [5] digest_0.6.27                 mime_0.9                     
##  [7] R6_2.4.1                      stats4_4.0.3                 
##  [9] evaluate_0.14                 httr_1.4.2                   
## [11] ggplot2_3.3.2                 pillar_1.4.6                 
## [13] rlang_0.4.8                   curl_4.3                     
## [15] blob_1.2.1                    magick_2.5.0                 
## [17] S4Vectors_0.28.0              DT_0.16                      
## [19] rmarkdown_2.5                 stringr_1.4.0                
## [21] htmlwidgets_1.5.2             RCurl_1.98-1.2               
## [23] bit_4.0.4                     munsell_0.5.0                
## [25] shiny_1.5.0                   compiler_4.0.3               
## [27] httpuv_1.5.4                  xfun_0.18                    
## [29] pkgconfig_2.0.3               base64enc_0.1-3              
## [31] htmltools_0.5.0               tidyselect_1.1.0             
## [33] tibble_3.0.4                  gridExtra_2.3                
## [35] interactiveDisplayBase_1.28.0 bookdown_0.21                
## [37] IRanges_2.24.0                XML_3.99-0.5                 
## [39] crayon_1.3.4                  dplyr_1.0.2                  
## [41] later_1.1.0.1                 bitops_1.0-6                 
## [43] rappdirs_0.3.1                grid_4.0.3                   
## [45] xtable_1.8-4                  gtable_0.3.0                 
## [47] lifecycle_0.2.0               DBI_1.1.0                    
## [49] magrittr_1.5                  scales_1.1.1                 
## [51] stringi_1.5.3                 promises_1.1.1               
## [53] ellipsis_0.3.1                generics_0.0.2               
## [55] vctrs_0.3.4                   rjson_0.2.20                 
## [57] tools_4.0.3                   bit64_4.0.5                  
## [59] Biobase_2.50.0                glue_1.4.2                   
## [61] purrr_0.3.4                   BiocVersion_3.12.0           
## [63] fastmap_1.0.1                 yaml_2.2.1                   
## [65] AnnotationDbi_1.52.0          colorspace_1.4-1             
## [67] BiocManager_1.30.10           memoise_1.1.0                
## [69] knitr_1.30Barardo, Diogo, Daniel Thornton, Harikrishnan Thoppil, Michael Walsh, Samim Sharifi, Susana Ferreira, Andreja Anžič, et al. 2017. “The DrugAge Database of Aging-Related Drugs.” Aging Cell 16 (3). Wiley Online Library:594–97. http://onlinelibrary.wiley.com/doi/10.1111/acel.12585/full.
Cao, Yiqun, Anna Charisi, Li-Chang Cheng, Tao Jiang, and Thomas Girke. 2008. “ChemmineR: A Compound Mining Framework for R.” Bioinformatics 24 (15):1733–4. http://dx.doi.org/10.1093/bioinformatics/btn307.
Gaulton, Anna, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, et al. 2012. “ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery.” Nucleic Acids Res. 40 (Database issue):D1100–7. http://dx.doi.org/10.1093/nar/gkr777.
Kim, Sunghwan, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, et al. 2019. “PubChem 2019 Update: Improved Access to Chemical Data.” Nucleic Acids Res. 47 (D1):D1102–D1109. http://dx.doi.org/10.1093/nar/gky1033.
Lamb, Justin, Emily D Crawford, David Peck, Joshua W Modell, Irene C Blat, Matthew J Wrobel, Jim Lerner, et al. 2006. “The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease.” Science 313 (5795):1929–35. http://dx.doi.org/10.1126/science.1132939.
Seiler, Kathleen Petri, Gregory A George, Mary Pat Happ, Nicole E Bodycombe, Hyman A Carrinski, Stephanie Norton, Steve Brudz, et al. 2008. “ChemBank: A Small-Molecule Screening and Cheminformatics Resource Database.” Nucleic Acids Res. 36 (Database issue):D351–9. http://dx.doi.org/10.1093/nar/gkm843.
Subramanian, Aravind, Rajiv Narayan, Steven M Corsello, David D Peck, Ted E Natoli, Xiaodong Lu, Joshua Gould, et al. 2017. “A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles.” Cell 171 (6):1437–1452.e17. http://dx.doi.org/10.1016/j.cell.2017.10.049.
The UniProt Consortium. 2017. “UniProt: The Universal Protein Knowledgebase.” Nucleic Acids Res. 45 (D1):D158–D169. http://dx.doi.org/10.1093/nar/gkw1099.
Wishart, David S, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, et al. 2018. “DrugBank 5.0: A Major Update to the DrugBank Database for 2018.” Nucleic Acids Res. 46 (D1):D1074–D1082. http://dx.doi.org/10.1093/nar/gkx1037.