_customCMPdb_: Integrating Community and Custom Compound Collections

suppressPackageStartupMessages({
  library(customCMPdb); library(ChemmineR)
})

Introduction

This package serves as a query interface for important community collections of small molecules, while also allowing users to include custom compound collections. Both annotation and structure information is provided. The annotation data is stored in an SQLite database, while the structure information is stored in Structure Definition Files (SDF). Both are hosted on Bioconductor's AnnotationHub. A detailed description of the included data types is provided under the Supplemental Material section of this vignette. At the time of writing, the following community databases are included:

In addition to providing access to the above compound collections, the package supports the integration of custom collections of compounds, that will be automatically stored for the user in the same data structure as the preconfigured databases. Both custom collections and those provided by this package can be queried in a uniform manner, and then further analyzed with cheminformatics packages such as ChemmineR, where SDFs are imported into flexible S4 containers [@Cao2008-np].

Installation and Loading

As Bioconductor package customCMPdb can be installed with the BiocManager::install() function.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("customCMPdb")

To obtain the most recent updates of the package immediately, one can also install it directly from GitHub as follows.

devtools::install_github("yduan004/customCMPdb", build_vignettes=TRUE)

Next the package needs to be loaded in a user's R session.

library(customCMPdb)
library(help = "customCMPdb")  # Lists package info

Open vignette of this package.

browseVignettes("customCMPdb")  # Opens vignette

Overview

The following introduces how to load and query the different datasets.

DrugAge Annotations

The compound annotation tables are stored in an SQLite database. This data can be loaded into a user's R session as follows (here for drugAgeAnnot).

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("customCMPdb", "annot_0.1"))
annot_path <- ah[["AH79563"]]
library(RSQLite)
conn <- dbConnect(SQLite(), annot_path)
dbListTables(conn)
drugAgeAnnot <- dbReadTable(conn, "drugAgeAnnot")
head(drugAgeAnnot)
dbDisconnect(conn)

DrugAge SDF

The corresponding structures for the above DrugAge example can be loaded into an SDFset object as follows.

query(ah, c("customCMPdb", "drugage_build2"))
da_path <- ah[["AH79564"]]
da_sdfset <- ChemmineR::read.SDFset(da_path)

Instructions on how to work with SDFset objects are provided in the ChemmineR vignette here. For instance, one can plot any of the loaded structures with the plot function.

ChemmineR::cid(da_sdfset) <- ChemmineR::sdfid(da_sdfset)
ChemmineR::plot(da_sdfset[1])

DrugBank SDF

The SDF from DrugBank can be loaded into R the same way. The corresponding SDF file was downloaded from here. During the import into R ChemmineR checks the validity of the imported compounds.

query(ah, c("customCMPdb", "drugbank_5.1.5"))
db_path <- ah[["AH79565"]]
db_sdfset <- ChemmineR::read.SDFset(db_path)

CMAP SDF

The import of the SDF of the CMAP02 database works the same way.

query(ah, c("customCMPdb", "cmap02"))
cmap_path <- ah[["AH79566"]]
cmap_sdfset <- ChemmineR::read.SDFset(cmap_path)

LINCS SDF

The same applies to the SDF of the small molecules included in the LINCS database.

query(ah, c("customCMPdb", "lincs_pilot1"))
lincs_path <- ah[["AH79567"]]
lincs_sdfset <- ChemmineR::read.SDFset(lincs_path)

For reproducibility, the R code for generating the above datasets is included in the inst/scripts/make-data.R file of this package. The file location on a user's system can be obtained with system.file("scripts/make-data.R", package="customCMPdb").

Custom Annotation Database

Load Annotation Database

The SQLite Annotation Database is hosted on Bioconductor's AnnotationHub. Users can download it to a local AnnotationHub cache directory. The path to this directory can be obtained as follows.

library(AnnotationHub)
ah <- AnnotationHub()
annot_path <- ah[["AH79563"]]

Add Custom Annotation Tables

The following introduces how users can import to the SQLite database their own compound annotation tables. In this case, the corresponding ChEMBL IDs need to be included under the chembl_id column. The name of the custom data set can be specified under the annot_name argument. Note, this name is case insensitive.

chembl_id <- c("CHEMBL1000309", "CHEMBL100014", "CHEMBL10",
               "CHEMBL100", "CHEMBL1000", NA)
annot_tb <- data.frame(cmp_name=paste0("name", 1:6),
        chembl_id=chembl_id,
        feature1=paste0("f", 1:6),
        feature2=rnorm(6))
addCustomAnnot(annot_tb, annot_name="myCustom")

Delete Custom Annotation Tables

The following shows how to delete custom annotation tables by referencing them by their name. To obtain a list of custom annotation tables present in the database, the listAnnot function can be used.

listAnnot()
deleteAnnot("myCustom")
listAnnot()

Set to Default

The defaultAnnot function sets the annotation SQLite database back to the original version provided by customCMPdb. This is achieved by deleting the existing (e.g. custom) database and re-downloading a fresh instance from AnnotationHub.

defaultAnnot()

Query Annotation Database

The queryAnnotDB function can be used to query the compound annotations from the default resources as well as the custom resources stored in the SQLite annotation database. The query can be a set of ChEMBL IDs. In this case it returns a data.frame containing the annotations of the matching compounds from the selected annotation resources specified under the \code{annot} argument. The listAnnot function returns the names that can be assigned to the annot argument.

query_id <- c("CHEMBL1064", "CHEMBL10", "CHEMBL113", "CHEMBL1004", "CHEMBL31574")
listAnnot()
qres <- queryAnnotDB(query_id, annot=c("drugAgeAnnot", "lincsAnnot"))
qres
# query the added custom annotation
addCustomAnnot(annot_tb, annot_name="myCustom")
qres2 <- queryAnnotDB(query_id, annot=c("lincsAnnot", "myCustom"))
qres2

Since the supported compound databases use different identifiers, a ChEMBL ID mapping table is used to connect identical entries across databases as well as to link out to other resources such as ChEMBL itself or PubChem. For custom compounds, where ChEMBL IDs are not available yet, one can use alternative and/or custom identifiers.

query_id <- c("BRD-A00474148", "BRD-A00150179", "BRD-A00763758", "BRD-A00267231")
qres3 <- queryAnnotDB(chembl_id=query_id, annot=c("lincsAnnot"))
qres3

Supplemental Material

Description of Four Annotation Tables in SQLite Database

The DrugAge database is manually curated by experts. It contains an extensive compilation of drugs, compounds and supplements (including natural products and nutraceuticals) with anti-aging properties that extend longevity in model organisms [@Barardo2017-xk]. The DrugAge database was downloaded from here as a CSV file. The downloaded drugage.csv file contains compound_name, synonyms, species, strain, dosage, avg_lifespan_change, max_lifespan_change, gender, significance, and pubmed_id annotation columns. Since the DrugAge database only contains the drug name as identifiers, it is necessary to map the drug name to other uniform drug identifiers, such as ChEMBL IDs. In this package, the drug names have been mapped to ChEMBL [@Gaulton2012-ji], [PubChem]((https://pubchem.ncbi.nlm.nih.gov/) [@Kim2019-tg] and DrugBank IDs semi-manually and stored under the inst/extdata directory named as drugage_id_mapping.tsv. Part of the id mappings in the drugage_id_mapping.tsv table is generated by the \code{processDrugage} function for compound names that have ChEMBL ids from the ChEMBL database (version 24). The missing IDs were added manually. A semi-manual approach was to use this web service. After the semi-manual process, the left ones were manually mapped to ChEMBL, PubChem and DrugBank ids. The entries that are mixture like green tee extract or peptide like Bacitracin were commented. Then the drugage_id_mapping table was built into the annotation SQLite database named as compoundCollection_0.1.db by buildDrugAgeDB function.

The DrugBank annotation table was downloaded from the DrugBank database in xml file. The most recent release version at the time of writing this document is 5.1.5.
The extracted xml file was processed by the \code{dbxml2df} function in this package. dbxml2df and df2SQLite functions in this package were used to load the xml file into R and covert to a data.frame R object, then stored in the compoundCollection SQLite annotation database. There are 55 annotation columns in the DrugBank annotation table, such as drugbank_id, name, description, cas-number, groups, indication, pharmacodynamics, mechanism-of-action, toxicity, metabolism, half-life, protein-binding, classification, synonyms, international-brands, packagers, manufacturers, prices, dosages, atc-codes, fda-label, pathways, targets. The DrugBank id to ChEMBL id mappings were obtained from UniChem.

The CMAP02 annotation table was processed from the downloaded compound instance table using the buildCMAPdb function defined by this package. The CMAP02 instance table contains the following drug annotation columns: instance_id, batch_id, cmap_name, INN1, concentration (M), duration (h), cell2, array3, perturbation_scan_id, vehicle_scan_id4, scanner, vehicle, vendor, catalog_number, catalog_name. Drug names are used as drug identifies. The buildCMAPdb function maps the drug names to external drug ids including UniProt [@The_UniProt_Consortium2017-bx], PubChem, DrugBank and ChemBank [@Seiler2008-dw] ids. It also adds additional annotation columns such as directionality, ATC codes and SMILES structure. The generated cmap.db SQLite database from buildCMAPdb function contains both compound annotation table and structure information. The ChEMBL id mappings were further added to the annotation table via PubChem CID to ChEMBL id mappings from UniChem. The CMAP02 annotation table was stored in the compoundCollection SQLite annotation database. Then the CMAP internal IDs to ChEMBL id mappings were added to the ID mapping table.

The LINCS compound annotation table was downloaded from GEO where only compounds were selected. The annotation columns are lincs_id, pert_name, pert_type, is_touchstone, inchi_key_prefix, inchi_key, canonical_smiles, pubchem_cid. The annotation table was stored in the compoundCollection SQLite annotation database. Since the annotation only contains LINCS id to PubChem CID mapping, the LINCS ids were also mapped to ChEMBL ids via inchi key.

Session Info

sessionInfo()

References



Try the customCMPdb package in your browser

Any scripts or data that you put into this service are public.

customCMPdb documentation built on Nov. 8, 2020, 5:40 p.m.