Working with PubChemR to Access Chemical Data
In PubChemR: Interface to the 'PubChem' Database for Chemical Data Retrieval

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = FALSE,
  class.output="scroll-100",
  cache.path = "cached/"
)

```{css css_options, echo=FALSE} pre { max-height: 300px; overflow-y: auto; }

pre[class] { max-height: 300px; }

```{css css_scroll, echo=FALSE}
.scroll-100 {
  max-height: 100px;
  overflow-y: auto;
  background-color: inherit;
}

1. Introduction

The PubChemR package is designed for R users who need to interact with the PubChem database, a free resource from the National Center for Biotechnology Information (NCBI). PubChem is a key repository of chemical and biological data, including information on chemical structures, identifiers, chemical and physical properties, biological activities, patents, health, safety, toxicity data, and much more.

This package simplifies the process of accessing and manipulating this vast array of data directly from R, making it a valuable resource for chemists, biologists, bioinformaticians, and researchers in related fields. In this vignette, we will explore the various functionalities offered by the PubChemR package. Each function is designed to allow users to efficiently retrieve specific types of data from PubChem. We will cover how to install and load the package, provide detailed descriptions of each function, and demonstrate their usage with practical examples.

2. Installation

The PubChemR package is can be installed either from the Comprehensive R Archive Network (CRAN) or directly from its GitHub repository, offering users the flexibility to choose between the stable CRAN version or the latest development version with potentially newer features and fixes.

Installing from CRAN

For most users, installing PubChemR from CRAN is the recommended method as it ensures a stable and tested version of the package. You can install it using the standard R package installation command:

install.packages("PubChemR")

This command will download and install the PubChemR package along with any dependencies it requires. Once installed, you can load the package in your R session as follows:

library(PubChemR)

Installing the Development Version from GitHub

For users who are interested in the latest features and updates that might not yet be available on CRAN, the development version of PubChemR can be installed from GitHub. This version is likely to include recent enhancements and bug fixes but may also be less stable than the CRAN release.

To install the development version, you will first need to install the devtools package, which provides functions to install packages directly from GitHub and other sources. You can install devtools from CRAN using:

install.packages("devtools")

Once devtools is installed, you can install the development version of PubChemR using:

devtools::install_github("selcukorkmaz/PubChemR")

This command downloads and installs the package from the specified GitHub repository. After installation, load the package as usual:

library(PubChemR)

3. Implementation

The PubChemR package offers a suite of functions designed to interact with the PubChem database, allowing users to retrieve and manipulate chemical data efficiently. Below is an overview of the main functions provided by the package:

3.1. Retrieving AIDs with get_aids()

The get_aids function is designed to retrieve Assay IDs (AIDs) from the PubChem database. This function is useful for accessing detailed assay data related to specific compounds or substances, which is crucial in fields such as pharmacology, biochemistry, and molecular biology.

The function supports a range of identifiers including integers (e.g., CID and SID) and strings (e.g., name, SMILES, InChIKey and formula). Users can specify the namespace and domain for the query, as well as the type of search to be performed (e.g., substructure, superstructure, similarity, identity).

Here are the main parameters of the function:

identifier: A vector of positive integers (e.g. cid, sid) or identifier strings (name, smiles, inchikey, formula).
namespace: Specifies the type of identifier provided.
domain: Specifies the domain of the query.
searchtype: Specifies the type of search to be performed.
options: Additional arguments.

Retrieving AIDs by CID

In this example, we retrieve AIDs for the compounds with CID (Compound ID) 2244 (aspirin), 2519 (caffein) and 3672 (ibuprofen):

aids_by_cid <- get_aids(
  identifier = c(2244, 2519, 3672),
  namespace = "cid",
  domain = "compound"
)

aids_by_cid

The above code retrieves AIDs for the compounds with CIDs 2244, 2519 and 3672. The output shows the request details including the domain (Compound), namespace (Compound ID), and identifier (2244, 2519, ... and 1 more). This provides a summary of the query performed.

To retrieve the AIDs associated with these compounds, we use the AIDs function on the result. This getter function return the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

aids <- AIDs(object = aids_by_cid, .to.data.frame = TRUE)
aids

The output is a tibble (data frame) with two columns: CID and AID. The CID column contains the compound IDs (2244, 2519 and 3672), and the AID column contains the Assay IDs.

table(aids$CID)

There are 8,831 rows in total, indicating 3,195 assays related to the aspirin, 2,352 assays related to the caffein and 3,284 assays related to the ibuprofen.

Retrieving AIDs by SID

In this example, we retrieve Assay IDs for the substance with SID (Substance ID) 103414350:

aids_by_sid <- get_aids(
  identifier = c(103414350, 103204295),
  namespace = "sid",
  domain = "substance"
)

aids_by_sid

The above code retrieves Assay IDs for the substance with SIDs (Substance IDs) 103414350 and 103204295. The output shows the request details including the domain (Substance), namespace (Substance ID), and identifier (103414350, 103204295). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the SIDs 103414350 and 103204295, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_sid, .to.data.frame = TRUE)

The output is a tibble (data frame) with two columns: SID and AID. The SID column contains the substance ID (103414350 and 103204295), and the AID column contains the Assay There are a total of 8 rows, with 5 assays related to 103414350 and 3 assays related to 103204295.

Retrieving AIDs by Name

In this example, we retrieve Assay IDs for the compounds with the names paracetamol, naproxen, and diclofenac:

aids_by_name <- get_aids(
  identifier = c("paracetamol", "naproxen", "diclofenac"),
  namespace = "name",
  domain = "compound"
)

aids_by_name

The output shows the request details including the domain (Compound), namespace (Name), and identifier (aspirin). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the compound names, we use the AIDs function on the result:

aids <- AIDs(object = aids_by_name, .to.data.frame = TRUE)
aids

The output is a tibble with three columns: NAME, CID and AID. The NAME column includes compound names, the CID column contains the compound IDs, and the AID column contains the assay IDs.

table(aids$NAME)

There are 5,192 rows in total, indicating 1,593 assays related to the diclofenac, 1,542 assays related to the naproxen and 2,057 assays related to the paracetamol.

Retrieving AIDs by SMILES

In this example, we retrieve Assay IDs (AIDs) for aspirin using its SMILES representation:

aids_by_smiles <- get_aids(
  identifier = "CC(=O)OC1=CC=CC=C1C(=O)O",
  namespace = "smiles",
  domain = "compound"
)

aids_by_smiles

The above code retrieves AIDs for aspirin with the SMILES notation CC(=O)OC1=CC=CC=C1C(=O)O. The domain is set to compound and the namespace is set to smiles to indicate that the identifier is a SMILES string.

To extract the AIDs associated with the SMILES representation, we use the AIDs function on the result:

AIDs(object = aids_by_smiles, .to.data.frame = TRUE)

The output is a tibble with three columns: SMILES, CID and AID. The SMILES column includes SMILES representation of aspirin, the CID column contains the compound ID of aspirin, and the AID column contains the related assay IDs.

Retrieving AIDs by InChIKey

In this example, we retrieve Assay IDs for the compound with InChIKey (International Chemical Identifier Key) GALPCCIBXQLXSH-UHFFFAOYSA-N:

aids_by_inchikey <- get_aids(
  identifier = "GALPCCIBXQLXSH-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

aids_by_inchikey

The above code retrieves Assay IDs for the compound with InChIKey GALPCCIBXQLXSH-UHFFFAOYSA-N. The output shows the request details including the domain (Compound), namespace (INCHI Key), and identifier (GALPCCIBXQLXSH-UHFFFAOYSA-N). This provides a summary of the query performed.

To retrieve the Assay IDs associated with the InChIKey, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_inchikey, .to.data.frame = TRUE)

The output is a tibble (data frame) with three columns: INCHIKEY, CID, and AID. The INCHIKEY column contains the InChIKey (GALPCCIBXQLXSH-UHFFFAOYSA-N in this case), the CID column contains the compound ID (44375542), and the AID column contains the Assay IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 5 rows in total, indicating the assays related to the compound.

Retrieving AIDs by Formula

In this example, we retrieve Assay IDs for compounds with the molecular formula C15H12N2O2:

aids_by_formula <- get_aids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

aids_by_formula

The above code retrieves Assay IDs for compounds with the molecular formula C15H12N2O2. The output shows the request details including the domain (Compound), namespace (Formula), and identifier (C15H12N2O2). This provides a summary of the query performed.

To retrieve the Assay IDs associated with this formula, we use the AIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

AIDs(object = aids_by_formula, .to.data.frame = TRUE)

The output is a tibble (data frame) with three columns: FORMULA, CID, and AID. The FORMULA column contains the molecular formula (C15H12N2O2), the CID column contains the compound ID, and the AID column contains the Assay IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 50,116 rows in total, indicating a comprehensive list of assays related to compounds with the specified molecular formula.

3.2. Retrieving CIDs with get_cids()

The get_cids function is designed to retrieve Compound IDs (CIDs) from the PubChem database. This function is particularly useful for users who need to obtain the unique identifiers assigned to chemical substances within PubChem.

The function queries the PubChem database using various identifiers such as names, formulas, or other chemical identifiers. It then extracts the corresponding CIDs and returns them in a structured format. This makes it a versatile tool for researchers working with chemical data.

Here are the main parameters of the function:

identifier: A vector of identifiers for which CIDs are to be retrieved. These can be integers (e.g., CID, SID, AID) or strings (e.g., name, SMILES, InChIKey).
namespace: Specifies the type of identifier provided. It can be ‘cid’, ‘name’, ‘smiles’, ‘inchi’, etc.
domain: The domain of the query, typically ‘compound’.
searchtype: The type of search to be performed, such as ‘substructure’ or ‘similarity’.
options: Additional arguments passed to the internal get_json function.

Retrieving CIDs by Name

In this example, we retrieve Compound IDs for the compounds with the names aspirin, caffeine, and ibuprofen:

cids_by_name <- get_cids(
  identifier = c("aspirin", "caffein", "ibuprofen"),
  namespace = "name",
  domain = "compound"
)

cids_by_name

The above code retrieves Compound IDs for the compounds named aspirin, caffeine, and ibuprofen. The output shows the request details including the domain (Compound), namespace (Name), and identifiers (aspirin, caffeine, ibuprofen). This provides a summary of the query performed.

To retrieve the Compound IDs associated with the compound names, we use the CIDs function on the result:

CIDs(object = cids_by_name)

The CIDs function call on the result extracts the Compound IDs associated with the compound names. The output is a tibble with two columns: Name and CID. The Name column contains the compound names, and the CID column contains the Compound IDs. This tibble format makes it easy to handle and analyze the data in R.

Retrieving CIDs by SMILES

In this example, we retrieve Compound IDs (CIDs) for a compound using its SMILES representation:

cids_by_smiles <- get_cids(
  identifier = "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O",
  namespace = "smiles",
  domain = "compound"
)

cids_by_smiles

The above code retrieves CIDs for the compound with the SMILES notation C([C@@H]1C@HO)O. The domain is set to compound and the namespace is set to smiles to indicate that the identifier is a SMILES string.

To extract the CIDs associated with the SMILES representation, we use the CIDs function on the result:

CIDs(object = cids_by_smiles)

The CIDs function call on the result extracts the CIDs associated with the SMILES notation C([C@@H]1C@HO)O. The output is a tibble with two columns: SMILES and CID. The SMILES column contains the SMILES notation, and the CID column contains the Compound IDs. This output shows that the specified compound is associated with CID 5793.

Retrieving CIDs by InChIKey

In this example, we retrieve Compound IDs (CIDs) for a compound using its InChIKey:

cids_by_inchikey <- get_cids(
  identifier = "HEFNNWSXXWATRW-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

cids_by_inchikey

The above code retrieves CIDs for the compound with the InChIKey HEFNNWSXXWATRW-UHFFFAOYSA-N. The domain is set to compound and the namespace is set to inchikey to indicate that the identifier is an InChIKey.

To extract the CIDs associated with the InChIKey, we use the CIDs function on the result:

CIDs(object = cids_by_inchikey)

The CIDs function call on the result extracts the CIDs associated with the InChIKey HEFNNWSXXWATRW-UHFFFAOYSA-N. The output is a tibble with two columns: INCHI Key and CID. The INCHI Key column contains the InChIKey, and the CID column contains the Compound IDs. This output shows that the specified compound is associated with CID 3672.

Retrieving CIDs by Formula

In this example, we retrieve Compound IDs (CIDs) for compounds with the molecular formula C15H12N2O2:

cids_by_formula <- get_cids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

cids_by_formula

The above code retrieves Compound IDs for compounds with the molecular formula C15H12N2O2. The output shows the request details including the domain (Compound), namespace (Formula), and identifier (C15H12N2O2). This provides a summary of the query performed.

To retrieve the Compound IDs associated with this formula, we use the CIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

CIDs(object = cids_by_formula, .to.data.frame = TRUE)

The output is a tibble (data frame) with two columns: Formula and CID. The Formula column contains the molecular formula (C15H12N2O2), and the CID column contains the Compound IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 5,032 rows in total, indicating a comprehensive list of compounds related to the specified molecular formula.

3.3. Retrieving SIDs with get_sids()

The get_sids function is designed to retrieve Substance IDs (SIDs) from the PubChem database. This function is essential for users who need to identify unique identifiers assigned to specific chemical substances or mixtures in PubChem.

The get_sids function queries the PubChem database using various identifiers and extracts the corresponding SIDs. It is capable of handling multiple identifiers and returns a structured tibble (data frame) containing the SIDs along with the original identifiers. This makes it a versatile tool for researchers working with chemical data.

Here are the main parameters of the function:

identifier: A vector specifying the identifiers for which SIDs are to be retrieved. These can be numeric or character vectors.
namespace: Specifies the type of identifier provided, with ‘cid’ as the default.
domain: The domain of the query, typically ‘compound’.
searchtype: Specifies the type of search to be performed, if applicable.
options: Additional arguments passed to the internal get_json function.

Retrieving SIDs by CID

In this example, we retrieve Substance IDs (SIDs) for the compound with CID (Compound ID) 2244:

sids_by_cid <- get_sids(
  identifier = c(2244, 2519, 3672),
  namespace = "cid",
  domain = "compound"
)

sids_by_cid

The above code retrieves Substance IDs for the compound with CID (Compound ID) 2244. The output shows the request details including the domain (Compound), namespace (Compound ID), and identifier (2244). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the compound ID 2244, we use the SIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

sids <- SIDs(object = sids_by_cid, .to.data.frame = TRUE)
sids

The output is a tibble (data frame) with two columns: Compound ID and SID. The Compound ID column contains the compound IDs, and the SID column contains the Substance IDs.

table(sids$`Compound ID`)

There are 1,288 rows in total, indicating 400 substances related to the compound ID 2244, 486 substances related to the compound ID 2519, and 402 substances related to the compound ID 3672.

Retrieving SIDs by AID

In this example, we retrieve Substance IDs (SIDs) for the assay with AID (Assay ID) 1234:

sids_by_aids <- get_sids(
  identifier = "1234",
  namespace = "aid",
  domain = "assay"
)

sids_by_aids

The above code retrieves Substance IDs for the assay with AID (Assay ID) 1234. The output shows the request details including the domain (Assay), namespace (Assay ID), and identifier (1234). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the assay ID 1234, we use the SIDs function on the result. This getter function returns the results either as a tibble (data frame) or as a list, depending on the .to.data.frame argument.

SIDs(object = sids_by_aids, .to.data.frame = TRUE)

The output is a tibble (data frame) with two columns: Assay ID and SID. The Assay ID column contains the assay ID (1234 in this case), and the SID column contains the Substance IDs. This tibble format makes it easy to analyze and manipulate the data in R. There are 61 rows in total, indicating a list of substances related to the assay.

Retrieving SIDs by Name

In this example, we retrieve Substance IDs for the compound with the name aspirin:

sids <- get_sids(
  identifier = "aspirin",
  namespace = "name",
  domain = "compound"
)

sids

The above code retrieves Substance IDs for the compound named aspirin. The output shows the request details including the domain (Compound), namespace (Name), and identifier (aspirin). This provides a summary of the query performed.

To retrieve the Substance IDs associated with the compound name aspirin, we use the SIDs function on the result:

SIDs(object = sids)

The SIDs function call on the result extracts the Substance IDs associated with the compound name aspirin. The output is a tibble with two columns: SID and Name. The SID column contains the Substance IDs, and the Name column contains the compound name (aspirin in this case). This tibble format makes it easy to handle and analyze the data in R. There are 2,356 rows in total, indicating a comprehensive list of substances related to the compound name aspirin.

Retrieving SIDs by SMILES

sids_by_smiles <- get_sids(
  identifier = "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O",
  namespace = "smiles",
  domain = "compound"
)

sids_by_smiles

SIDs(object = sids_by_smiles)

Retrieving SIDs by InChIKey

In this example, we retrieve Substance IDs (SIDs) for a compound using its InChIKey:

sids_by_inchikey <- get_sids(
  identifier = "BPGDAMSIGCZZLK-UHFFFAOYSA-N",
  namespace = "inchikey",
  domain = "compound"
)

sids_by_inchikey

The above code retrieves SIDs for the compound with the InChIKey BPGDAMSIGCZZLK-UHFFFAOYSA-N. The domain is set to compound and the namespace is set to inchikey to indicate that the identifier is an InChIKey.

To extract the SIDs associated with the InChIKey, we use the SIDs function on the result:

SIDs(object = sids_by_inchikey)

The SIDs function call on the result extracts the SIDs associated with the InChIKey BPGDAMSIGCZZLK-UHFFFAOYSA-N. The output is a tibble with two columns: INCHI Key and SID. The INCHI Key column contains the InChIKey, and the SID column contains the Substance IDs. This output shows that the specified compound is associated with 93 substance entries, each represented by a SID.

Retrieving SIDs by Formula

sids_by_formula <- get_sids(
  identifier = "C15H12N2O2",
  namespace = "formula",
  domain = "compound"
)

sids_by_formula

SIDs(object = sids_by_formula, .to.data.frame = TRUE)

3.4. Retrieving Assay Data with get_assays()

The get_assays function is designed to retrieve biological assay data from the PubChem database. This function is particularly useful for researchers and scientists who need descriptive information about various biological assays.

The function queries the PubChem database using specified identifiers and returns a list of assay data. It is capable of fetching various assay information, including experimental data, results, and methodologies.

Here are the main parameters of the function:

identifier: A vector of positive specifying the assay identifiers (AIDs) for which data are to be retrieved. operation: The operation to be performed on the input records, defaulting to NULL. Expected opreation: record, concise, aids, sids, cids, description, targets/, , summary, classification. options: Additional parameters for the query, currently not affecting the results.

Retrieving Assays by AIDs

In this example, we retrieve assay data for several specific AIDs:

assay_data <- get_assays(
  identifier = c(485314, 485341, 504466, 624202, 651820), 
  namespace = "aid"
)

assay_data

The above code retrieves assay data for multiple AIDs. The output shows the request details, including the domain (Assay), namespace (Assay ID), and identifiers. It also provides instructions on how to retrieve specific instances from the complete list and view all requested instance identifiers.

To view the request arguments:

request_args(object = assay_data)

To retrieve detailed information about a specific assay (e.g., 651820), you can use the instance function on the result:

aid_651820 <- instance(object = assay_data, .which = 651820)
aid_651820

The instance function call on the result extracts detailed information about the specific assay, including experimental data, results, and methodologies. This information is crucial for understanding the biological activity and properties of the compounds tested in the assay.

To extract specific details from the assay data, you can use the retrieve function with various slots:

retrieve(object = aid_651820, .slot = "aid", .to.data.frame = TRUE)

This code extracts the Assay ID and version of the assay, providing a concise summary of the assay's unique identifier and its version in the PubChem database.

retrieve(object = aid_651820, .slot = "aid_source", .to.data.frame = TRUE)

This code retrieves the source information for the assay, including the name of the source and the source ID, which helps in identifying the origin of the assay data.

retrieve(object = aid_651820, .slot = "name", .to.data.frame = FALSE)

This code extracts the name of the assay, providing a clear description of the assay's purpose and target.

retrieve(object = aid_651820, .slot = "description", .to.data.frame = FALSE, .verbose = TRUE)

This code retrieves the detailed description of the assay, including its purpose, the challenges addressed, and the methodology used. This is crucial for understanding the context and rationale behind the assay.

retrieve(object = aid_651820, .slot = "protocol", .to.data.frame = FALSE, .verbose = TRUE)

This code retrieves the detailed protocol for conducting the assay, providing step-by-step instructions, including the materials needed, preparation steps, and the assay procedure. This is crucial for replicating the experiment and ensuring consistent results.

retrieve(object = aid_651820, .slot = "comment", .to.data.frame = FALSE, .verbose = TRUE)

This code retrieves additional contextual information and detailed criteria for evaluating the activity of compounds in the assay. In this specific case, it includes the PUBCHEM_ACTIVITY_OUTCOME and PUBCHEM_ACTIVITY_SCORE, which help in interpreting the assay results and determining the activity level of the compounds tested.

retrieve(object = aid_651820, .slot = "xref", .to.data.frame = FALSE)

This code retrieves external references related to the assay, such as links to relevant publications and additional assay IDs. This helps in contextualizing the assay within the broader scientific literature and finding related studies.

retrieve(object = aid_651820, .slot = "results", .to.data.frame = TRUE)

This code retrieves a tibble with detailed experimental results, including EC50 values, activation percentages, and other key metrics. This data is essential for analyzing the performance of the compounds in the assay and making informed conclusions about their biological activity.

retrieve(object = aid_651820, .slot = "revision", .to.data.frame = FALSE)

This code retrieves the revision number of the assay data, indicating the version of the data retrieved. This helps track changes and updates to the assay information over time.

retrieve(object = aid_651820, .slot = "activity_outcome_method", .to.data.frame = FALSE)

This code retrieves the method used to determine the activity outcome of the compounds in the assay. This information is crucial for understanding the criteria and process used to classify the compounds’ activity levels.

retrieve(object = aid_651820, .slot = "project_category", .to.data.frame = FALSE)

This code retrieves the category of the project under which the assay was conducted. This helps in identifying the broader context and objectives of the research project associated with the assay.

3.5. Retrieving Compound Data with get_compounds()

The get_compounds function is designed to streamline the process of retrieving detailed compound data from the extensive PubChem database. This function is an invaluable tool for chemists, biologists, pharmacologists, and researchers who require comprehensive chemical compound information for their scientific investigations and analyses.

The function interfaces directly with the PubChem database, allowing users to query and retrieve a wide array of data on chemical compounds. Upon execution, the function returns a list containing detailed information about each queried compound. This information can encompass various aspects such as:

Chemical Structures: Detailed representations of the molecular structure of compounds.
Chemical Properties: Information on physical and chemical properties such as molecular weight, boiling point, melting point, solubility, and more.
Biological Activities: Data on the biological activities and effects of the compounds, including bioassay results.
Synonyms and Identifiers: A comprehensive list of alternative names and identifiers for the compounds.
Safety and Toxicity Information: Data on the safety and potential toxicity of the compounds.

Here are the main parameters of the function:

identifier: A vector specifying the compound identifiers. These identifiers can be either positive integers (such as CIDs, which are unique compound identifiers in PubChem) or identifier strings (such as chemical names, SMILES strings, InChI, etc.). This parameter allows for flexible input methods tailored to the specific needs of the user.
namespace: Specifies the type of identifier provided in the identifier parameter. Common values for this parameter include:
"cid" (Compound Identifier)
"name" (Chemical Name)
"smiles" (Simplified Molecular Input Line Entry System)
"inchi" (International Chemical Identifier)
"sdf" (Structure-Data File)
operation: An optional parameter specifying the operation to be performed on the input records. This can include operations such as filtering, sorting, or transforming the data based on specific criteria. By default, this parameter is set to NULL, indicating no additional operations are performed.
searchtype: An optional parameter that defines the type of search to be conducted. This can be used to refine and specify the search strategy, such as exact match, substructure search, or similarity search. By default, this parameter is set to NULL, indicating a general search.
options: A list of additional parameters that can be used to customize the query further. This can include options such as result limits, output formats, and other advanced settings to tailor the data retrieval process to specific requirements.

Retrieving Compounds by CIDs

In this example, we retrieve compound data for specific CIDs (Compound IDs) 2244 and 5245:

compound_data <- get_compounds(
  identifier = c(2244, 5245),
  namespace = "cid"
)

compound_data

The above code retrieves compound data for the compounds with CIDs 2244 and 5245. The output shows the request details, including the domain (Compound), namespace (Compound ID), and identifiers. It also provides instructions on how to retrieve specific instances from the complete list and view all requested instance identifiers.

To view the request arguments:

request_args(object = compound_data)

To retrieve detailed information about a specific compound, you can use the instance function on the result:

compound_2244 <- instance(object = compound_data, .which = 2244)
compound_2244

The instance function call on the result extracts detailed information about the specific compound, including chemical structures, properties, and identifiers.

To retrieve specific data elements from the compound data, you can use the retrieve function with the relevant slots:

retrieve(object = compound_2244, .slot = "id", .to.data.frame = TRUE)

The retrieve function call with the id slot extracts the compound identifier (CID) for the specific compound. In this case, the CID is 2244, confirming the identity of the compound.

retrieve(object = compound_2244, .slot = "atoms", .to.data.frame = FALSE)

The retrieve function call with the atoms slot extracts information about the atoms in the compound. The output includes two vectors: aid, representing the atom IDs, and element, representing the atomic numbers of the elements. For example, element 8 represents oxygen, and element 6 represents carbon.

retrieve(object = compound_2244, .slot = "bonds", .to.data.frame = FALSE)

The retrieve function call with the bonds slot extracts information about the bonds in the compound. The output includes three vectors: aid1 and aid2 represent the atom IDs involved in each bond, and order represents the bond order (e.g., single, double bonds).

retrieve(object = compound_2244, .slot = "coords", .to.data.frame = FALSE)

The retrieve function call with the coords slot extracts the coordinates of the atoms in the compound. The output includes details such as:

type: Represents the type of coordinates.
aid: Atom IDs for which the coordinates are provided.
conformers: Contains the conformer data, including x and y coordinates for each atom. This provides the spatial arrangement of the atoms in the compound, which is crucial for understanding the compound's 3D structure and interactions.

retrieve(object = compound_2244, .slot = "props", .to.data.frame = TRUE)

The retrieve function call with the props slot extracts detailed properties of the compound, including information such as label, name, data type, release, value, implementation, version, software, and source. This comprehensive information covers various physical, chemical, and structural properties of the compound.

retrieve(object = compound_2244, .slot = "count", .to.data.frame = TRUE)

The retrieve function call with the count slot extracts various count metrics for the compound. The output includes a tibble with two columns: Name and Value. This information includes:

heavy_atom: The number of heavy atoms in the compound. atom_chiral, atom_chiral_def, atom_chiral_undef: Counts of chiral atoms and their defined/undefined states. bond_chiral, bond_chiral_def, bond_chiral_undef: Counts of chiral bonds and their defined/undefined states. isotope_atom: The number of isotopic atoms. covalent_unit: The number of covalent units in the compound. tautomers: The number of tautomers.

These counts provide insights into the compound's chemical complexity and stereochemistry, which are essential for understanding its reactivity and biological activity.

3.6. Retrieving Substance Data with get_substances()

The get_substances function retrieves substance data from the PubChem database based on a specified identifier and namespace. This function is crucial for obtaining detailed information about a substance, including its various identifiers, sources, synonyms, comments, cross-references, and compound details.

Here are the main parameters of the function:

identifier: A character or numeric vector specifying the identifiers for the request. This can be a substance ID (SID), name, or other supported identifier.
namespace: Specifies the namespace for the request. The default value is ‘sid’.
operation: Specifies the operation to be performed on the input records. The default value is NULL.
searchtype: Specifies the type of search to be performed. The default value is NULL.
options: Additional parameters for the query. These can be used to customize the search further.

Retrieving Substances by Name

In this example, we retrieve substance data for aspirin:

substance_data <- get_substances(
  identifier = "aspirin",   
  namespace = "name"
)

substance_data

The above code retrieves substance data for the identifier "aspirin". The output indicates that the request details include the domain (Substance), namespace (Name), and identifier (aspirin). It also mentions that you can run the instance(...) function to extract specific instances and request_args(...) to see all requested instance identifiers.

To see the arguments used in the request, use the request_args function:

request_args(object = substance_data)

This output shows the namespace ("name"), identifier ("aspirin"), and domain ("substance") used in the request.

To extract specific substance data, we use the instance function with the specified identifier:

substance_aspirin <- instance(object = substance_data, .which = "aspirin")

substance_aspirin

The above output shows the request details for aspirin and indicates that 143 substances were retrieved. It lists the slots available for further data extraction. These slots include sid, source, synonyms, comment, xref, and compound.

To extract data from the sid slot as a data frame:

retrieve(object = substance_aspirin, .slot = "sid", .to.data.frame = TRUE)

This output shows the id and version for the substance "aspirin". The id is 4594 and the version is 10.

To extract data from the source slot as a data frame:

retrieve(object = substance_aspirin, .slot = "source", .to.data.frame = TRUE)

This output shows the source information for "aspirin". The source is KEGG, and the source ID is C01405.

To extract data from the synonyms slot:

retrieve(object = substance_aspirin, .slot = "synonyms", .to.data.frame = FALSE)

This output lists the synonyms for "aspirin". These include "2-Acetoxybenzenecarboxylic acid", "50-78-2", "Acetylsalicylate", "Acetylsalicylic acid", "Aspirin", and "C01405".

To extract data from the comment slot with verbosity:

retrieve(object = substance_aspirin, .slot = "comment", .to.data.frame = FALSE, .verbose = TRUE)

This output shows comments related to "aspirin". It indicates that "aspirin" is the same as D00109 and is a reactant of the enzyme EC: 3.1.1.55.

To extract data from the xref slot with verbosity:

retrieve(object = substance_aspirin, .slot = "xref", .to.data.frame = FALSE, .verbose = TRUE)

This output shows cross-references for "aspirin". It includes the source "regid" with value C01405, the source "rn" with value 50-78-2, the source "dburl" with the URL for the KEGG database, and the source "sburl" with a specific URL for the compound in the KEGG database.

To extract data from the compound slot:

retrieve(object = substance_aspirin, .slot = "compound", .to.data.frame = FALSE)

This output shows detailed compound data for "aspirin". It includes the atom IDs, elements, bond information, coordinates, and charge. Additionally, it provides an ID of the compound in PubChem (cid 2244).

Each section provides specific details about the substance "aspirin", making it possible to analyze different aspects of the substance data from the PubChem database.

3.7. Retrieving Chemical Properties with get_properties()

The get_properties function facilitates the retrieval of specific chemical properties of compounds from the PubChem database. This function is essential for researchers and chemists who require detailed chemical information about various compounds.

The function queries the PubChem database using specified identifiers and returns a list or dataframe containing the requested properties of each compound. These properties can include molecular weight, chemical formula, isomeric SMILES, and more, depending on the available data in PubChem and the properties requested. You may find the full list of properties at https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest#section=Compound-Property-Tables.

Here are the main parameters of the function:

properties: A character vector specifying the properties to be retrieved. This vector can include various chemical properties like mass, molecular formula, InChI, etc.
identifier: A vector of identifiers for the compounds. These identifiers can be either positive integers (such as CIDs, which are unique compound identifiers in PubChem) or identifier strings (such as chemical names, SMILES strings, InChI, etc.).
namespace: Specifies the type of identifier provided in the identifier parameter. The default value is 'cid'. Common values for this parameter include cid, name, smiles inchi
searchtype: An optional parameter that defines the type of search to be conducted. This can be used to refine and specify the search strategy, such as exact match, substructure search, or similarity search. By default, this parameter is set to NULL, indicating a general search.
options: Additional arguments for the query. These can be used to customize the search further, but by default, it is set to NULL.
propertyMatch: A list that specifies matching criteria for the properties. It includes:
.ignore.case: A logical value indicating whether to ignore case when matching property names. Default is FALSE.
type: Specifies the type of match to be performed, such as "contain", "exact", "all". Default is "contain".

Retrieving Properties by Compounds

In this example, we retrieve properties for the compounds "aspirin" and "ibuprofen". The propertyMatch argument is used to specify matching criteria, such as ignoring case and using a "contain" type search. Therefore, this code retrieves the properties containing "mass", "molecular", and "inchi" for the compounds "aspirin" and "ibuprofen", ignoring case sensitivity.

props <- get_properties(
  properties = c("mass", "molecular", "inchi"),
  identifier = c("aspirin", "ibuprofen"),
  namespace = "name",
  propertyMatch = list(
    .ignore.case = TRUE,
    type = "contain"
  )
)
props

To extract specific details from the property data, you can use the retrieve function with various slots:

retrieve(object = props, .which = "aspirin", .to.data.frame = TRUE)

This code extracts the properties of aspirin, providing a detailed summary of its CID, molecular formula, molecular weight, InChI, InChIKey, exact mass, and monoisotopic mass.

retrieve(object = props, .which = "ibuprofen", .to.data.frame = FALSE)

This code extracts the properties of ibuprofen and displays them as a list. The properties include CID, molecular formula, molecular weight, InChI, InChIKey, exact mass, and monoisotopic mass.

retrieve(object = props, .to.data.frame = TRUE, .combine.all = TRUE)

This code combines the properties of all retrieved compounds (aspirin and ibuprofen) into a single dataframe, making it easier to compare their properties side-by-side.

3.8. Retrieving Chemical Properties with get_synonyms()

The get_synonyms function is designed to retrieve synonyms for chemical compounds or substances from the PubChem database. It is particularly useful for obtaining various names and identifiers associated with a specific chemical entity.

The function queries the PubChem database for synonyms of a given identifier (such as a Compound ID or a chemical name) and returns a comprehensive list of alternative names and identifiers. This can include systematic names, trade names, registry numbers, and other forms of identification used in scientific literature and industry.

Here are the main parameters of the function:

identifier: The identifier for which synonyms are to be retrieved. This can be a numeric value (like a Compound ID) or a character string (like a chemical name).
namespace: Specifies the namespace for the query. Common values include: ‘cid’ (Compound Identifier) [default] ‘name’ (Chemical Name)
domain: Specifies the domain for the request. Typically, this is ‘compound’. The default value is ‘compound’.
searchtype: Specifies the type of search to be performed. The default value is NULL.
options: Additional arguments for customization of the request.

Retrieving Synonyms by Compound

In this example, we retrieve synonyms for the compound "aspirin":

synonyms <- get_synonyms(
  identifier = "aspirin",
  namespace = "name"
)

synonyms

The above code retrieves synonyms for the compound "aspirin" using its name as the identifier. The namespace is set to "name" to indicate that the identifier is a chemical name.

The output is a list of synonyms for the compound "aspirin". These synonyms include various names and identifiers associated with the compound in different contexts, such as:

Systematic names (e.g., "2-Acetoxybenzoic acid")
Trade names (e.g., "Ecotrin")
Registry numbers (e.g., "50-78-2")
Other alternative names (e.g., "Acetosalin", "Polopiryna")

The retrieved synonyms provide a comprehensive view of the different names and identifiers that can be used to reference the same chemical entity in scientific literature and industry.

3.9. Retrieving List of Depositors with get_all_sources()

The get_all_sources function facilitates the retrieval of a list of all current depositors for substances or assays from the PubChem database. This function is particularly useful for users who need to identify and analyze the sources of chemical data.

The function queries the PubChem database to obtain a comprehensive list of sources (such as laboratories, companies, or research institutions) that have contributed substance or assay data. This information can be crucial for researchers and professionals who are tracking the origin of specific chemical data or assessing the diversity of data sources in PubChem.

Here is the main parameter of the function:

domain: Specifies the domain for which sources are to be retrieved. The domain can be either ‘substance’ or ‘assay’. The default value is ‘substance’.

Retrieving All Sources by Substances

In this example, we retrieve all sources for substances:

substance_sources <- get_all_sources(
  domain = "substance"
)

substance_sources

The above code retrieves a comprehensive list of all sources that have contributed substance data to PubChem. The domain parameter is set to 'substance' to specify that we are interested in sources for substances.

The output is a list of sources for substances. These sources include various laboratories, companies, and research institutions that have deposited data in PubChem.

Example sources from the list include:

"001Chemical"
"10X CHEM"
"3B Scientific (Wuhan) Corp"
"A&J Pharmtech CO., LTD."
"AA BLOCKS"

This list provides insights into the contributors of substance data, which can be used for further analysis, validation, and understanding of the chemical data landscape in PubChem.

3.10. Retrieving SDF data with get_sdf()

The get_sdf function is designed to retrieve data in Structure Data File (SDF) format from the PubChem database. This function is particularly useful for chemists and researchers who need to work with molecular structure data.

The function requests SDF data for a specified compound or substance from PubChem. Upon successful retrieval, it saves the data as an SDF file in the specified directory or in a temporary folder if no path is provided. This allows for easy storage and further analysis of molecular structures.

Here are the main parameters of the function:

identifier: The identifier for the compound or substance. This can be a CID, name, or other identifier supported by PubChem. namespace: Specifies the namespace for the query. The default value is ‘cid’. Common values include:
"cid" (Compound Identifier)
"name" (Chemical Name)
domain: Specifies the domain for the request. The default value is ‘compound’.
operation: An optional operation for the request.
searchtype: An optional search type to refine the query.
path: The path where the SDF file will be saved. If NULL, the function saves the file in a temporary folder.
file_name: The name for the downloaded SDF file. Defaults to a combination of the identifier and a timestamp.
options: Additional parameters for the request.

Downloading SDF by Compound

In this example, we retrieve and save SDF data for the compound "aspirin":

get_sdf(
  identifier = "aspirin",
  namespace = "name",
  path = NULL,
  file_name = "aspirin_structure"
)

The above code retrieves the SDF data for the compound "aspirin" using its name as the identifier. The namespace is set to "name" to indicate that the identifier is a chemical name. The path is set to tempdir(), which saves the file in the system's temporary directory. The file_name is specified as "aspirin_structure".

3.11. Download PubChem Data with download()

The download function facilitates downloading content from the PubChem database in various formats. It allows users to specify the type of content, the identifier for the query, and the destination for saving the downloaded file.

This function interacts with the PubChem database to retrieve data for a given identifier in a specified format. It supports various output formats like JSON, SDF, etc., and saves the content to a user-defined location on the local file system. This makes it a versatile tool for obtaining and managing chemical and biological data.

Here are the main parameters of the function:

filename: The name of the file to be saved. Defaults to the identifier if not specified.
outformat: The desired output format (e.g., SDF, JSON).
path: The path where the content should be saved. If not specified, it defaults to a temporary directory.
identifier: The identifier for the query (e.g., CID, SID, AID, name). This can be a chemical name or a specific identifier supported by PubChem.
`namespace: The namespace for the query. Common values include:
"cid" (Compound Identifier)
"name" (Chemical Name)
domain: The domain of the query. Common values include:
"compound"
"substance"
operation: The operation to be performed (optional).
searchtype: The type of search to be performed (optional).
overwrite: Whether to overwrite the file if it already exists. The default is FALSE.
options: Additional arguments for the request.

Downloading JSON File by Compound

In this example, we download a JSON file for the compound "aspirin":

download(
  filename = "Aspirin",
  outformat = "JSON",
  path = tempdir(),
  identifier = "aspirin",
  namespace = "name",
  domain = "compound",
  overwrite = TRUE
)

The above code downloads the JSON file for the compound aspirin using its name as the identifier. The namespace is set to "name" to indicate that the identifier is a chemical name. The path is set to tempdir(), which saves the file in the system's temporary directory. The filename is specified as Aspirin, and the outformat is set to JSON. The overwrite parameter is set to TRUE, allowing the function to overwrite the file if it already exists.

The output message confirms that the file has been saved with the name "Aspirin.JSON" in the specified directory (system's temporary directory).
The message indicates that the operation is completed successfully.

Any scripts or data that you put into this service are public.

PubChemR documentation built on April 4, 2025, 2:18 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PubChemR Interface to the 'PubChem' Database for Chemical Data Retrieval

Working with PubChemR to Access Chemical Data In PubChemR: Interface to the 'PubChem' Database for Chemical Data Retrieval

1. Introduction

2. Installation

Installing from CRAN

Installing the Development Version from GitHub

3. Implementation

3.1. Retrieving AIDs with get_aids()

Retrieving AIDs by CID

Retrieving AIDs by SID

Retrieving AIDs by Name

Retrieving AIDs by SMILES

Retrieving AIDs by InChIKey

Retrieving AIDs by Formula

3.2. Retrieving CIDs with get_cids()

Retrieving CIDs by Name

Retrieving CIDs by SMILES

Retrieving CIDs by InChIKey

Retrieving CIDs by Formula

3.3. Retrieving SIDs with get_sids()

Retrieving SIDs by CID

Retrieving SIDs by AID

Retrieving SIDs by Name

Retrieving SIDs by SMILES

Retrieving SIDs by InChIKey

Retrieving SIDs by Formula

3.4. Retrieving Assay Data with get_assays()

Retrieving Assays by AIDs

3.5. Retrieving Compound Data with get_compounds()

Retrieving Compounds by CIDs

3.6. Retrieving Substance Data with get_substances()

Retrieving Substances by Name

3.7. Retrieving Chemical Properties with get_properties()

Retrieving Properties by Compounds

3.8. Retrieving Chemical Properties with get_synonyms()

Retrieving Synonyms by Compound

3.9. Retrieving List of Depositors with get_all_sources()

Retrieving All Sources by Substances

3.10. Retrieving SDF data with get_sdf()

Downloading SDF by Compound

3.11. Download PubChem Data with download()

Downloading JSON File by Compound

Try the PubChemR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

PubChemR
Interface to the 'PubChem' Database for Chemical Data Retrieval

Working with PubChemR to Access Chemical Data
In PubChemR: Interface to the 'PubChem' Database for Chemical Data Retrieval