Accessing and Exploring Chemical Data with PubChemR: A Guide to PUG REST Service

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  # cache = TRUE,
  class.output="scroll-100",
  cache.path = "cached/"
)
library(PubChemR)

```{css, echo=FALSE} pre { max-height: 300px; overflow-y: auto; }

pre[class] { max-height: 300px; }

```{css, echo=FALSE}
.scroll-100 {
  max-height: 100px;
  overflow-y: auto;
  background-color: inherit;
}

1. Introduction

The PubChemR package introduces a pivotal function, get_pug_rest, designed to facilitate seamless access to the vast chemical data repository of PubChem. This function leverages the capabilities of PubChem's Power User Gateway (PUG) REST service, providing a straightforward and efficient means for users to programmatically interact with PubChem's extensive database. This vignette aims to elucidate the structure and usage of the PUG REST service, offering a range of illustrative use cases to aid new users in understanding its operation and constructing effective requests.

2. PUG REST: A Gateway to Chemical Data

PUG REST, standing for Power User Gateway RESTful interface, is a simplified access route to PubChem's data and services. It is designed for scripts, web page embedded JavaScript, and third-party applications, eliminating the need for the more complex XML and SOAP envelopes required by other PUG variants. PUG REST's design revolves around the PubChem identifier (SID for substances, CID for compounds, and AID for assays) and is structured into three main request components: input (identifiers), operation (actions on identifiers), and output (desired information format).

2.1. Key Features of PUG REST:

Usage Policy:

3. Accessing PUG REST with get_pug_rest

Overview

The get_pug_rest function in the PubChemR package provides a versatile interface to access a wide range of chemical data from the PubChem database. This section of the vignette focuses on various methods to retrieve chemical structure information and other related data using the PUG REST service. The function is designed to be flexible, accommodating different input methods, operations, and output formats.

This function sends a request to the PubChem PUG REST API to retrieve various types of data for a given identifier. It supports fetching data in different formats and allows saving the output.

get_pug_rest(
  identifier = NULL,
  namespace = "cid",
  domain = "compound",
  operation = NULL,
  output = "JSON",
  searchtype = NULL,
  property = NULL,
  options = NULL,
  save = FALSE,
  dpi = 300,
  path = NULL,
  file_name = NULL,
  ...
)

Arguments

Value

The function returns different types of content based on the specified output format:

JSON: Returns a list. CSV and TXT: Returns a data frame. SDF: Returns an SDF file of the requested identifier. PNG: Returns an image object or saves an image file.

3.1. Input Methods

In the context of the PubChem PUG REST API, the input methods define how records of interest are specified for a request. There are several ways to define this input, with the most common methods being outlined below:

1. By Identifier: The most straightforward method to specify input is by using identifiers directly. These identifiers can be Substance IDs (SIDs) or Compound IDs (CIDs). For example, to retrieve the names of a substance with CID 2244, you can use the get_pug_rest function as follows:

result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", output = "JSON")
result

The pubChemData function then processes the result to extract and display the retrieved data. Here's an interpretation of the output for CID 2244:

pubChemDataResult <- pubChemData(result)

The JSON response contains detailed information about the compound identified by CID 2244. The PC_Compounds array holds the compound data, and within it, each element corresponds to a specific compound.

For CID 2244, the following information is retrieved:

ID: Confirms the compound identifier is CID 2244.

pubChemDataResult$PC_Compounds[[1]]$id

Atoms: Details the atomic composition, with an aid array listing the atom IDs and an element array listing the atomic numbers (e.g., 6 for carbon, 8 for oxygen, 1 for hydrogen).

pubChemDataResult$PC_Compounds[[1]]$atoms

Bonds: Describes the bonds between atoms, including arrays for the IDs of the atoms involved (aid1 and aid2) and the bond order.

pubChemDataResult$PC_Compounds[[1]]$bonds

Coordinates: Provides the spatial coordinates (x and y) for each atom, which can be used to visualize the molecular structure.

pubChemDataResult$PC_Compounds[[1]]$coords

Charge: Indicates the compound's charge, which is 0 in this case.

pubChemDataResult$PC_Compounds[[1]]$charge

Properties: Lists various properties of the compound, including:

Compound Complexity: A measure of the molecular complexity.

pubChemDataResult$PC_Compounds[[1]]$props[[2]]

Hydrogen Bond Acceptor/Donor Count: Indicates the number of hydrogen bond acceptors and donors.

pubChemDataResult$PC_Compounds[[1]]$props[[3]]
pubChemDataResult$PC_Compounds[[1]]$props[[4]]

Rotatable Bond Count: The number of rotatable bonds, which impacts the molecule's flexibility.

pubChemDataResult$PC_Compounds[[1]]$props[[5]]

IUPAC Names: Various standardized names for the compound, such as "2-acetoxybenzoic acid" and "2-acetyloxybenzoic acid".

pubChemDataResult$PC_Compounds[[1]]$props[[7]]

InChI and InChIKey: Standardized identifiers for the chemical structure.

pubChemDataResult$PC_Compounds[[1]]$props[[13]]
pubChemDataResult$PC_Compounds[[1]]$props[[14]]

Log P: The partition coefficient, indicating the compound's hydrophobicity.

pubChemDataResult$PC_Compounds[[1]]$props[[15]]

Molecular Formula: The chemical formula of the compound, which is C9H8O4.

pubChemDataResult$PC_Compounds[[1]]$props[[17]]

Molecular Weight: The compound's molecular weight, 180.16 g/mol.

pubChemDataResult$PC_Compounds[[1]]$props[[18]]

SMILES: Canonical and isomeric Simplified Molecular Input Line Entry System (SMILES) strings, which are text representations of the chemical structure. Topological Polar Surface Area: A measure of the molecule's surface area that can form hydrogen bonds.

pubChemDataResult$PC_Compounds[[1]]$props[[19]]

For multiple IDs, a vector of IDs can be used. For instance, to retrieve a CSV table of compound properties:

result <- get_pug_rest(identifier = c("1","2","3","4","5"), namespace = "cid", domain = "compound", property = c("MolecularFormula","MolecularWeight","CanonicalSMILES"), output = "CSV")
result

The output of this request, when processed with pubChemData, provides a data frame containing the specified properties for each CID:

pubChemData(result)

Each row in the table corresponds to a different CID and lists the requested properties, facilitating easy comparison and further analysis.

2. By Name: In addition to using direct identifiers, you can refer to a chemical by its common name. This method allows users to search for compounds using familiar names instead of numerical identifiers. It's important to note that a single name might correspond to multiple records in the PubChem database. For example, the name "glucose" can refer to several different compounds or isomers. Here’s how you can retrieve Compound IDs (CIDs) for "glucose":

result <- get_pug_rest(identifier = "glucose", namespace = "name", domain = "compound", operation = "cids", output = "TXT")
result

The pubChemData function is then used to process the result and extract the retrieved data. The function retrieves the output in data frame. The output indicates that the search for "glucose" returned a single CID:

pubChemData(result)

This output reveals that the common name "glucose" corresponds to the CID 5793. This CID can then be used in further queries to retrieve detailed information about the compound, such as its molecular structure, properties, and associated bioactivities.

Using a common name for searching can simplify the process, especially when the numerical identifiers are not known. However, because a name can map to multiple records, the results might need further filtering or validation to ensure they correspond to the specific compound of interest.

3. By Structure Identity: Another method to specify a compound in PubChem PUG REST API requests is by using structural identifiers such as SMILES or InChI keys. This approach allows for precise identification of chemical structures by providing a textual representation of the molecule. For example, to retrieve the CID for the SMILES string "CCCC" (which represents butane), you can use the following code:

result <- get_pug_rest(identifier = "CCCC", namespace = "smiles", domain = "compound", operation = "cids", output = "TXT")
result

When the pubChemData(result) function is executed, it retrieves the output in data frame. The output indicates that the search for the SMILES string "CCCC" returned a single CID:

pubChemData(result)

This output reveals that the SMILES string "CCCC" corresponds to the CID 7843. This CID can then be used in further queries to gather detailed information about the compound, such as its molecular structure, physical and chemical properties, biological activities, and more.

Using structure-based identifiers like SMILES or InChI keys is particularly useful for precise and unambiguous chemical searches, as these identifiers provide a detailed representation of the molecule's structure. This method ensures that the exact compound of interest is identified, reducing the risk of ambiguity that might arise with common names or other identifiers.

4. By Fast (Synchronous) Structure Search: In PubChem PUG REST API, fast (synchronous) structure search allows for quicker searches by identity, similarity, substructure, and superstructure, often returning results in a single call. This method is efficient for obtaining results quickly and is useful for various types of structural queries.

To illustrate, let’s perform a fast identity search for the compound with CID 5793, using the same connectivity option:

result <- get_pug_rest(identifier = "5793", namespace = "cid", domain = "compound", operation = "cids", output = "TXT", searchtype = "fastidentity", options = list(identity_type = "same_connectivity"))
result

When the following code executed, it retrieves the output in data frame, listing all CIDs that match the fast identity search criteria.

pubChemData(result)

The output indicates that there are numerous compounds with similar connectivity to the compound identified by CID 5793. Each row in the output represents a different CID that shares the same connectivity pattern as the original compound. This extensive list includes hundreds of CIDs, showcasing the effectiveness of the fast identity search in identifying structurally related compounds quickly.

This method is advantageous for researchers needing to identify compounds with similar structures for further study, such as drug development, chemical analysis, or bioactivity screening. The fast response time and comprehensive results make it a valuable tool for various chemical and pharmaceutical applications.

5. By Cross-Reference (XRef): The cross-reference (XRef) method allows for reverse lookup of records using a cross-reference value. This method is particularly useful for linking external identifiers, such as patent numbers, to records in the PubChem database. For example, to find all SIDs linked to a specific patent identifier, you can use the following code:

result <- get_pug_rest(identifier = "US20050159403A1", namespace = "xref/PatentID", domain = "substance", operation = "sids", output = "JSON")
result

pubChemData function retrieves all SIDs that are linked to the specified patent identifier. The output indicates that the specified patent identifier "US20050159403A1" is linked to a large number of SIDs. Each SID represents a different substance that has been referenced in the patent.

pubChemData(result)

This result provides a comprehensive list of substances associated with the specified patent, allowing for further exploration and analysis within the PubChem database.

3.2. Available Data

Once you’ve specified the records of interest in PUG REST, the next step is to define what information you want to retrieve about these records. PUG REST excels in providing access to specific data points about each record, such as individual properties or cross-references, without the need to download and sift through large datasets.

1. Full Records: PUG REST allows the retrieval of entire records in various formats like JSON, CSV, TXT, and SDF. For example, to retrieve the record for aspirin (CID 2244) in SDF format:

result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", output = "SDF")

Multiple records can also be requested in a single call, though large lists may be subject to timeouts.

2. Images: Images of chemical structures can be retrieved by specifying PNG format. This works with various input methods, including chemical names, SMILES strings, and InChI keys. For example, to get an image for the chemical name "lipitor":

get_pug_rest(identifier = "lipitor", namespace = "name", domain = "compound", output = "PNG")

3. Compound Properties: Pre-computed properties for PubChem compounds are accessible individually or in tables. For instance, to get the molecular weight of a compound:

result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", property = "MolecularWeight", output = "TXT")
result
pubChemData(result)

Or to retrieve a CSV table of multiple compounds and properties:

result <- get_pug_rest(identifier = c("1","2","3","4","5"), namespace = "cid", domain = "compound", property = c("MolecularWeight", "MolecularFormula", "HBondDonorCount", "HBondAcceptorCount", "InChIKey", "InChI"), output = "CSV")
pubChemData(result)

4. Synonyms: To view all synonyms of a compound, such as Vioxx:

result <- get_pug_rest(identifier = "vioxx", namespace = "name", domain = "compound", operation = "synonyms", output = "JSON")
result
pubChemData(result)

5. Cross-References (XRefs): PUG REST provides access to various cross-references. For example, to retrieve MMDB identifiers for protein structures containing aspirin:

result <- get_pug_rest(identifier = "2244", namespace = "cid", domain = "compound", operation = c("xrefs","MMDBID"), output = "JSON")
result
pubChemData(result)

Or to find all patent identifiers associated with a given SID:

result <- get_pug_rest(identifier = "137349406", namespace = "sid", domain = "substance", operation = c("xrefs","PatentID"), output = "TXT")
result
pubChemData(result)

These examples illustrate the versatility of PUG REST in fetching specific data efficiently. It's an ideal tool for users who need quick access to particular pieces of information from the vast PubChem database without the overhead of processing bulk data.

3.3. BioAssays

PubChem BioAssays are complex entities containing a wealth of data. PUG REST provides access to both complete assay records and specific components of BioAssay data, allowing users to efficiently retrieve the information they need.

1. Assay Description: To obtain the description section of a BioAssay, which includes authorship, general description, protocol, and data readout definitions, use a request like:

result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "description", output = "JSON")
result
pubChemData(result)

For a simplified summary format that includes target information, active and inactive SID and CID counts:

result <- get_pug_rest(identifier = "1000", namespace = "aid", domain = "assay", operation = "summary", output = "JSON")
result
pubChemData(result)

2. Assay Data: To retrieve the entire data set of an assay in CSV format:

result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", output = "CSV")
result
pubChemData(result)

For a subset of data rows, specify the SIDs:

result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "JSON?sid=104169547,109967232", output = "JSON")
result
result <- pubChemData(result)
result$PC_AssaySubmit$assay

For concise data (e.g., active concentration readout) with additional information:

result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "concise", output = "JSON")
result
pubChemData(result)

For dose-response curve data:

result <- get_pug_rest(identifier = "504526", namespace = "aid", domain = "assay", operation = "doseresponse/CSV?sid=104169547,109967232", output = "CSV")
result
pubChemData(result)

3. Targets: To retrieve assay targets, including protein or gene identifiers:

result <- get_pug_rest(identifier = c("490","1000"), namespace = "aid", domain = "assay", operation = "targets/ProteinGI,ProteinName,GeneID,GeneSymbol", output = "JSON")
result
pubChemData(result)

To select assays via target identifier:

result <- get_pug_rest(identifier = "USP2", namespace = "target/genesymbol", domain = "assay", operation = "aids", output = "TXT")
result
pubChemData(result)

4. Activity Name: To select BioAssays by the name of the primary activity column:

result <- get_pug_rest(identifier = "EC50", namespace = "activity", domain = "assay", operation = "aids", output = "JSON")
result
pubChemData(result)

These examples demonstrate the flexibility of PUG REST in accessing specific BioAssay data. Users can efficiently retrieve detailed descriptions, comprehensive data sets, concise readouts, and target information, making it a valuable tool for researchers and scientists working with BioAssay data.

3.4. Genes

PubChem provides various methods to access gene data, making it a valuable resource for genetic research. Here's how you can utilize PUG REST to access gene-related information:

1. Gene Input Methods:

result <- get_pug_rest(identifier = "1956,13649", namespace = "geneid", domain = "gene", operation = "summary", output = "JSON")
result
pubChemData(result)
result <- get_pug_rest(identifier = "EGFR", namespace = "genesymbol", domain = "gene", operation = "summary", output = "JSON")
result
pubChemData(result)
result <- get_pug_rest(identifier = "ERBB1", namespace = "synonym", domain = "gene", operation = "summary", output = "JSON")
result
pubChemData(result)

2. Available Gene Data:

result <- get_pug_rest(identifier = "1956,13649", namespace = "geneid", domain = "gene", operation = "summary", output = "JSON")
result
pubChemData(result)
result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "aids", output = "TXT")
result
pubChemData(result)
result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "concise", output = "JSON")
result
pubChemData(result)
result <- get_pug_rest(identifier = "13649", namespace = "geneid", domain = "gene", operation = "pwaccs", output = "TXT")
result
pubChemData(result)

These methods offer a comprehensive way to access and analyze gene-related data in PubChem, catering to various research needs in genetics and molecular biology.

3.5. Proteins

PubChem provides a versatile platform for accessing detailed protein data, essential for researchers in biochemistry and molecular biology. Here's how you can utilize PUG REST for protein-related queries:

1. Protein Input Methods:

result <- get_pug_rest(identifier = "P00533,P01422", namespace = "accession", domain = "protein", operation = "summary", output = "JSON")
result
pubChemData(result)

By Protein Synonym: Access protein data using synonyms or identifiers from external sources. For example, using a ChEMBL ID:

result <- get_pug_rest(identifier = "ChEMBL:CHEMBL203", namespace = "synonym", domain = "protein", operation = "summary", output = "JSON")
result
pubChemData(result)

2. Available Protein Data:

result <- get_pug_rest(identifier = "P00533,P01422", namespace = "accession", domain = "protein", operation = "summary", output = "JSON")
result
pubChemData(result)

Assays from Protein: Retrieves a list of AIDs tested against a specific protein. For example, for protein accession P00533:

result <- get_pug_rest(identifier = "P00533", namespace = "accession", domain = "protein", operation = "aids", output = "TXT")
result
pubChemData(result)

Bioactivities from Protein: Returns concise bioactivity data for a specific protein. For example:

result <- get_pug_rest(identifier = "Q01279", namespace = "accession", domain = "protein", operation = "concise", output = "JSON")
result
pubChemData(result)

Pathways from Protein: Provides a list of pathways involving a specific protein. For example, for protein accession P00533:

result <- get_pug_rest(identifier = "P00533", namespace = "accession", domain = "protein", operation = "pwaccs", output = "TXT")
result
pubChemData(result)

These methods offer a comprehensive approach to accessing and analyzing protein-related data in PubChem, supporting a wide range of research applications in the fields of biochemistry, molecular biology, and pharmacology.

3.6. Pathways

PubChem's PUG REST service offers a detailed and comprehensive approach to accessing pathway information, crucial for researchers in fields like bioinformatics, pharmacology, and molecular biology. Here's how to utilize PUG REST for pathway-related queries:

1. Pathway Input Methods:

By Pathway Accession: The primary method to access pathway data in PubChem. Pathway Accession is formatted as Source:ID. For example, to get a summary for the Reactome pathway R-HSA-70171 in JSON format:

result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "summary", output = "JSON")
result
pubChemData(result)

2. Available Pathway Data:

Pathway Summary: Returns a summary including PathwayAccession, SourceName, Name, Type, Category, Description, TaxonomyID, and Taxonomy. For example:

result <- get_pug_rest(identifier = "Reactome:R-HSA-70171,BioCyc:HUMAN_PWY-4983", namespace = "pwacc", domain = "pathway", operation = "summary", output = "JSON")
result
pubChemData(result)

Compounds from Pathway: Retrieves a list of compounds involved in a specific pathway. For example, for the Reactome pathway R-HSA-70171:

result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "cids", output = "TXT")
result
pubChemData(result)

Genes from Pathway: Provides a list of genes involved in a specific pathway. For example, for the Reactome pathway R-HSA-70171:

result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "geneids", output = "TXT")
result
pubChemData(result)

Proteins from Pathway: Returns a list of proteins involved in a given pathway. For example, for the Reactome pathway R-HSA-70171:

result <- get_pug_rest(identifier = "Reactome:R-HSA-70171", namespace = "pwacc", domain = "pathway", operation = "accessions", output = "TXT")
result
pubChemData(result)

These methods offer a streamlined and efficient way to access and analyze pathway-related data in PubChem, supporting a wide range of research applications in bioinformatics, molecular biology, and related fields.

3.7. Taxonomies

PubChem's PUG REST service provides a comprehensive approach to accessing taxonomy information, essential for researchers in fields like biology, pharmacology, and environmental science. Here's how to utilize PUG REST for taxonomy-related queries:

1. Taxonomy Input Methods:

By Taxonomy ID: The primary method to access taxonomy data in PubChem using NCBI Taxonomy identifiers. For example, to get a summary for human (Taxonomy ID 9606) and SARS-CoV-2 (Taxonomy ID 2697049) in JSON format:

result <- get_pug_rest(identifier = "9606,2697049", namespace = "taxid", domain = "taxonomy", operation = "summary", output = "JSON")
result
pubChemData(result)

By Taxonomy Synonym: Access taxonomy data using synonyms like scientific or common names. For example, for Homo sapiens:

result <- get_pug_rest(identifier = "Homo sapiens", namespace = "synonym", domain = "taxonomy", operation = "summary", output = "JSON")
result
pubChemData(result) 

2. Available Taxonomy Data:

Taxonomy Summary: Returns a summary including TaxonomyID, ScientificName, CommonName, Rank, RankedLineage, and Synonyms. For example:

result <- get_pug_rest(identifier = "9606,10090,10116", namespace = "taxid", domain = "taxonomy", operation = "summary", output = "JSON")
result
pubChemData(result)

Assays and Bioactivities: Retrieves a list of assays (AIDs) associated with a given taxonomy. For example, for SARS-CoV-2:

result <- get_pug_rest(identifier = "2697049", namespace = "taxid", domain = "taxonomy", operation = "aids", output = "TXT")
result
pubChemData(result) 

To aggregate concise bioactivity data from each AID:

result <- get_pug_rest(identifier = "1409578", namespace = "aid", domain = "assay", operation = "concise", output = "JSON")
result
pubChemData(result)

These methods offer a streamlined and efficient way to access and analyze taxonomy-related data in PubChem, supporting a wide range of research applications in biology, pharmacology, environmental science, and related fields.

3.8. Cell Lines

PubChem's PUG REST service offers a valuable resource for accessing detailed information about various cell lines, crucial for research in cellular biology, pharmacology, and related fields. Here's how to effectively use PUG REST for cell line-related queries:

1. Cell Line Input Methods:

By Cell Line Accession: Utilize Cellosaurus and ChEMBL cell line accessions for precise data retrieval. For example:

result <- get_pug_rest(identifier = "CHEMBL3308376,CVCL_0045", namespace = "cellacc", domain = "cell", operation = "summary", output = "JSON")
result
pubChemData(result)

By Cell Line Synonym: Access data using cell line names or other synonyms. For instance, for the HeLa cell line:

result <- get_pug_rest(identifier = "HeLa", namespace = "synonym", domain = "cell", operation = "summary", output = "JSON")
result
pubChemData(result)

2. Available Cell Line Data:

Cell Line Summary: Provides a comprehensive overview including CellAccession, Name, Sex, Category, SourceTissue, SourceTaxonomyID, SourceOrganism, and Synonyms. For example:

result <- get_pug_rest(identifier = "CVCL_0030,CVCL_0045", namespace = "cellacc", domain = "cell", operation = "summary", output = "JSON")
result
pubChemData(result)

Assays and Bioactivities from Cell Line: Retrieves a list of assays (AIDs) tested on a specific cell line. For example, for HeLa:

result <- get_pug_rest(identifier = "HeLa", namespace = "synonym", domain = "cell", operation = "aids", output = "TXT")
result
pubChemData(result)

To aggregate concise bioactivity data from each AID:

result <- get_pug_rest(identifier = "79900", namespace = "aid", domain = "assay", operation = "concise", output = "JSON")
result
pubChemData(result)

These methods provide an efficient and targeted approach to access and analyze cell line-related data in PubChem, supporting a wide range of research applications in cellular biology, pharmacology, and related fields.



Try the PubChemR package in your browser

Any scripts or data that you put into this service are public.

PubChemR documentation built on April 4, 2025, 2:18 a.m.