BiocStyle::markdown()

Authors: Johannes Rainer
Modified: r file.info("create-compounddb.Rmd")$mtime
Compiled: r date()

Introduction

Chemical compound annotation and information can be retrieved from a variety of sources including HMDB, LipidMaps or ChEBI. The CompoundDb package provides functionality to extract data relevant for (chromatographic) peak annotations in metabolomics/lipidomics experiments from these sources and to store it into a common format (i.e. an CompDb object/database). This vignette describes how such CompDb objects can be created exemplified with package-internal test files that represent data subsets from some annotation resources.

The R object to represent the compound annotation is the CompDb object. Each object (respectively its database) is supposed to contain and provide annotations from a single source (e.g. HMDB or LipidMaps) but it is also possible to create cross-source databases too.

Create a CompDb object for HMDB

The CompDb package provides the compound_tbl_sdf and the compound_tbl_lipidblast functions to extract relevant compound annotation from files in SDF (structure-data file) format or in the json files from LipidBlast (http://mona.fiehnlab.ucdavis.edu/downloads). CompoundDb allows to process SDF files from:

Below we use the compound_tbl_sdf to extract compound annotations from a SDF file representing a very small subset of the HMDB database. To generate a database for the full HMDB we would have to download the structures.sdf file containing all metabolites and load that file instead.

library(CompoundDb)

## Locate the file
hmdb_file <- system.file("sdf/HMDB_sub.sdf.gz", package = "CompoundDb")
## Extract the data
cmps <- compound_tbl_sdf(hmdb_file)

The function returns by default a (data.frame-equivalent) tibble (from the tidyverse's tibble package).

cmps

The tibble contains columns

To create a simple compound database, we could pass this tibble along with additional required metadata information to the createCompDb function. In the present example we want to add however also MS/MS spectrum data to the database. We thus load below the MS/MS spectra for some of the compounds from the respective xml files downloaded from HMDB. To this end we pass the path to the folder in which the files are located to the msms_spectra_hmdb function. The function identifies the xml files containing MS/MS spectra based on their their file name and loads the respective spectrum data. The folder can therefore also contain other files, but the xml files from HMDB should not be renamed or the function will not recognice them. Note also that at present only MS/MS spectrum xml files from HMDB are supported (one xml file per spectrum); these could be downloaded from HMDB with the hmdb_all_spectra.zip file.

## Locate the folder with the xml files
xml_path <- system.file("xml", package = "CompoundDb")
spctra <- msms_spectra_hmdb(xml_path)

At last we have to create the metadata for the resource. The metadata information for a CompDb resource is crucial as it defines the origin of the annotations and its version. This information should thus be carefully defined by the user. Below we use the make_metadata helper function to create a data.frame in the expected format. The organism should be provided in the format e.g. "Hsapiens" for human or "Mmusculus" for mouse, i.e. capital first letter followed by lower case characters without whitespaces.

metad <- make_metadata(source = "HMDB", url = "http://www.hmdb.ca",
                       source_version = "4.0", source_date = "2017-09",
                       organism = "Hsapiens")

With all the required data ready we create the SQLite database for the HMDB subset. With path we specify the path to the directory in which we want to save the database. This defaults to the current working directory, but for this example we save the database into a temporary folder.

db_file <- createCompDb(cmps, metadata = metad, msms_spectra = spctra,
                        path = tempdir())

The variable db_file is now the file name of the SQLite database. We can pass this file name to the CompDb function to get the CompDb objects acting as the interface to the database.

cmpdb <- CompDb(db_file)
cmpdb

To extract all compounds from the database we can use the compounds function. The parameter columns allows to choose the database columns to return. Any columns for any of the database tables are supported. To get an overview of available database tables and their columns, the tables function can be used:

tables(cmpdb)

Below we extract only selected columns from the compounds table.

compounds(cmpdb, columns = c("compound_name", "formula", "mass"))

Analogously we can use the Spectra function to extract spectrum data from the database. The function returns by default a Spectra object from the R Biocpkg("Spectra") package with all spectra metadata available as spectra variables.

library(Spectra)
sps <- Spectra(cmpdb)
sps

The available spectra variables for the Spectra object can be retrieved with spectraVariables:

spectraVariables(sps)

Individual spectra variables can be accessed with the $ operator:

sps$collision_energy

And the actual m/z and intensity values with mz and intensity:

mz(sps)

## m/z of the 2nd spectrum
mz(sps)[[2]]

Note that it is also possible to retrieve specific spectra, e.g. for a provided compound, or add compound annotations to the Spectra object. Below we use the filter expression ~ compound_id == "HMDB0000001"to get only MS/MS spectra for the specified compound. In addition we ask for the "compound_name" and "inchi_key" of the compound.

sps <- Spectra(cmpdb, filter = ~ compound_id == "HMDB0000001",
               columns = c(tables(cmpdb)$msms_spectrum, "compound_name",
                           "inchi_key"))
sps

The available spectra variables:

spectraVariables(sps)

The compound's name and INCHI key have thus also been added as spectra variables:

sps$inchi_key

To share or archive the such created CompDb database, we can also create a dedicated R package containing the annotation. To enable reproducible research, each CompDb package should contain the version of the originating data source in its file name (which is by default extracted from the metadata of the resource). Below we create a CompDb package from the generated database file. Required additional information we have to provide to the function are the package creator/maintainer and its version.

createCompDbPackage(
    db_file, version = "0.0.1", author = "J Rainer", path = tempdir(),
    maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>")

The function creates a folder (in our case in a temporary directory) that can be build and installed with R CMD build and R CMD INSTALL.

Special care should also be put on the license of the package that can be passed with the license parameter. The license of the package and how and if the package can be distributed will depend also on the license of the originating resource.

Create a CompDb object for MoNa

MoNa (Massbank of North America) provides a large SDF file with all spectra which can be used to create a CompDb object with compound information and MS/MS spectra. Note however that MoNa is organized by spectra and the annotation of the compounds is not consistent and normalized. Spectra from the same compound can have their own compound identified and data that e.g. can differ in their chemical formula, precision of their exact mass or other fields.

Similar to the example above, compound annotations can be imported with the compound_tbl_sdf function while spectrum data can be imported with msms_spectra_mona. In the example below we use however the import_mona_sdf that wraps both functions to reads both compound and spectrum data from a SDF file without having to import the file twice. As an example we use a small subset from a MoNa SDF file that contains only 7 spectra.

mona_sub <- system.file("sdf/MoNa_export-All_Spectra_sub.sdf.gz",
                        package = "CompoundDb")
mona_data <- import_mona_sdf(mona_sub)

As a result we get a list with a data.frame each for compound and spectrum information. These can be passed along to the createCompDb function to create the database (see below).

metad <- make_metadata(source = "MoNa",
                       url = "http://mona.fiehnlab.ucdavis.edu/",
                       source_version = "2018.11", source_date = "2018-11",
                       organism = "Unspecified")
mona_db_file <- createCompDb(mona_data$compound, metadata = metad,
                             msms_spectra = mona_data$msms_spectrum,
                             path = tempdir())

We can now load and use this database, e.g. by extracting all compounds as shown below.

mona <- CompDb(mona_db_file)
compounds(mona)

As stated in the introduction of this section the compound information contains redundant information and the table has essentially one row per spectrum. Feedback on how to reduce the redundancy in the compound table is highly appreciated.

Session information

sessionInfo()


michaelwitting/CompoundDb documentation built on April 29, 2020, 8:42 p.m.