```{css, code = readLines(params$my_css), hide=TRUE, echo = FALSE}
```r knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(httptest) start_vignette("2")
if (!library(ctxR, logical.return = TRUE)){ devtools::load_all() } old_options <- options("width")
# Used to visualize data in a variety of plot designs library(ggplot2) library(gridExtra)
# Redefining the knit_print method to truncate character values to 25 characters # in each column and to truncate the columns in the print call to prevent # wrapping tables with several columns. #library(ctxR) knit_print.data.table = function(x, ...) { y <- data.table::copy(x) y <- y[, lapply(.SD, function(t){ if (is.character(t)){ t <- strtrim(t, 25) } return(t) })] print(y, trunc.cols = TRUE) } registerS3method( "knit_print", "data.table", knit_print.data.table, envir = asNamespace("knitr") )
In this vignette, CTX Chemical API will be explored.
The foundation of toxicology, toxicokinetics, and exposure is embedded in the physics and chemistry of chemical-biological interactions. The accurate characterization of chemical structure linked to commonly used identifiers, such as names and Chemical Abstracts Service Registry Numbers (CASRNs), is essential to support both predictive modeling of the data as well as dissemination and application of the data for chemical safety decisions.
With cheminformatics as the backbone for research efforts, sources of available data through the CTX Chemical API include:
More information on Chemicals and Chemistry Data can be found here: https://www.epa.gov/comptox-tools/downloadable-computational-toxicology-data#SCD.
::: {.noticebox data-latex=""} NOTE: Please see the introductory vignette for an overview of the ctxR package and initial set up instruction with API key storage. :::
Several ctxR functions can be used to access the CTX Chemical API data, as described in the following sections.Tables output in each example have been filtered to only display the first few rows of data.
get_chemical_details()
retrieves chemical detail data either using the chemical identifier DTXSID or DTXCID. Alternate parameter "projection" determines the type of data returned. Examples for each are provided below:
chemical_details_by_dtxsid <- get_chemical_details(DTXSID = 'DTXSID7020182')
chemical_details_by_dtxcid <- get_chemical_details(DTXCID = 'DTXCID30182')
vector_dtxsid<- c("DTXSID7020182", "DTXSID9020112", "DTXSID8021430") chemical_details_by_batch_dtxsid <- get_chemical_details_batch(DTXSID = vector_dtxsid) vector_dtxcid <- c("DTXCID30182", "DTXCID801430", "DTXCID90112") chemical_details_by_batch_dtxcid <- get_chemical_details_batch(DTXCID = vector_dtxcid)
check_existence_by_dtxsid()
checks if the supplied DTXSID is valid and returns a URL for additional information on the chemical in the case of a valid DTXSID.
dtxsid_check_true <- check_existence_by_dtxsid(DTXSID = 'DTXSID7020182') dtxsid_check_false <- check_existence_by_dtxsid(DTXSID = 'DTXSID7020182f')
vector_dtxsid_and_non_dtxsid <- c('DTXSID7020182F', 'DTXSID7020182', 'DTXSID0020232F') dtxsid_checks <- check_existence_by_dtxsid_batch(DTXSID = vector_dtxsid_and_non_dtxsid)
get_chemical_by_property_range()
retrieves data for chemicals that have a specified property within the input range.
chemical_by_property_range <- get_chemical_by_property_range(start = 1.311, end = 1.313, property = 'Density')
get_chem_info()
retrieves specific chemical information for an input chemical. This includes both experimental and predicted values by default, but providing "experimental" or "predicted" to the type parameter will return the specific associated information.
chemical_info <- get_chem_info(DTXSID = 'DTXSID7020182')
get_fate_by_dtxsid()
retrieves chemical fate data.
fate_by_dtxsid <- get_fate_by_dtxsid(DTXSID = 'DTXSID7020182')
Chemicals can be searched using string values. Examples for each are provided by the following:
search_starts_with <- chemical_starts_with(word = 'DTXSID70201')
search_exact <- chemical_equal(word = 'DTXSID7020182')
search_contains <- chemical_contains(word = 'DTXSID702018')
MS-Ready (McEachran, A. et al. 2018) data can be retrieved using a variety of input information. Examples for each are provided below:
msready_by_mass <- get_msready_by_mass(start = 200.9, end = 200.95)
msready_by_formula <- get_msready_by_formula(formula = 'C16H24N2O5S')
msready_by_dtxcid <- get_msready_by_dtxcid(DTXCID = 'DTXCID30182')
There are several lists of chemicals one can access using the (CCD list search). These can be filtered by the type, name, inclusion of a specific chemical, or name of list.
get_all_list_types()
chemical_lists_by_type <- get_chemical_lists_by_type(type = 'federal')
public_chemical_list_by_name <- get_public_chemical_list_by_name(listname = 'CCL4')
get_lists_containing_chemical()
retrieves a list of names of chemical lists, each of which contains the specified chemical.
lists_containing_chemical <- get_lists_containing_chemical(DTXSID = 'DTXSID7020182')
get_chemicals_in_list_start()
retrieves a list of DTXSIDs for a given starting character string in a specified list of chemicals.
chemicals_in_ccl4_start <- get_chemicals_in_list_start(list_name = 'CCL4', word = 'Bi')
get_chemicals_in_list_exact()
retrieves a list of DTXSIDs matching exactly a given character string in a specified list of chemicals.
chemicals_in_ccl4_exact <- get_chemicals_in_list_exact(list_name = 'BIOSOLIDS2021', word = 'Bisphenol A')
get_chemicals_in_list_contain()
retrieves a list of DTXSIDs that contain a given character string in a specified list of chemicals.
chemicals_in_ccl4_contain <- get_chemicals_in_list_contain(list_name = 'CCL4', word = 'Bis')
get_chemicals_in_list()
retrieves the specific chemical information for each chemical contained in the specified list.
chemicals_in_list <- get_chemicals_in_list(list_name = 'CCL4')
There are mrv, mol, and image files that can be accessed using either the DTXSID or DTXCID. Examples are provided below:
get_chemical_mrv()
retrieves mrv file information for a chemical specified either by DTXSID or DTXCID.
chemical_mrv_by_dtxsid <- get_chemical_mrv(DTXSID = 'DTXSID7020182') chemical_mrv_by_dtxcid <- get_chemical_mrv(DTXCID = 'DTXCID30182')
get_chemical_mol()
retrieves mol file information for a chemical specified either by DTXSID or DTXCID.
chemical_mol_by_dtxsid <- get_chemical_mol(DTXSID = 'DTXSID7020182') chemical_mol_by_dtxcid <- get_chemical_mol(DTXCID = 'DTXCID30182')
get_chemical_image()
retrieves image file information for a chemical specified either by DTXSID or DTXCID. To visualize the returned array of image information, the user may use either the png::writePNG()
or countcolors::plotArrayAsImage()
functions, among many choices.
chemical_image_by_dtxsid <- get_chemical_image(DTXSID = 'DTXSID7020182') chemical_image_by_dtxcid <- get_chemical_image(DTXCID = 'DTXCID30182') chemical_image_by_smiles <- get_chemical_image(SMILES = 'CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1') countcolors::plotArrayAsImage(chemical_image_by_dtxsid) countcolors::plotArrayAsImage(chemical_image_by_dtxcid) countcolors::plotArrayAsImage(chemical_image_by_smiles)
get_chemical_synonym()
retrieves synonyms for the specified chemical.
chemical_synonym <- get_chemical_synonym(DTXSID = 'DTXSID7020182')
The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that "...are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems...." Moreover, this list "...was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants...." The National-Scale Air Toxics Assessments (NATA) is "... EPA's ongoing comprehensive evaluation of air toxics in the United States... a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks... use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them...."
These lists can be found in the CCD at CCL4 with additional information at CCL4 information and NATADB with additional information at NATA information. The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.
In this example use case, physico-chemical Properties data will be compared between a water contaminant priority and an air toxics list. Note, the following code chunks use the data.table
object, which is an extension of the data.frame
object and has slightly different syntax. For more information, please refer to data.table
First, confirm the chemical list to query.
options(width = 100) ccl4_information <- get_public_chemical_list_by_name('CCL4') print(ccl4_information, trunc.cols = TRUE) natadb_information <- get_public_chemical_list_by_name('NATADB') print(natadb_information, trunc.cols = TRUE)
Next, retrieve the list of chemicals associated with each list.
ccl4 <- get_chemicals_in_list('ccl4') ccl4 <- data.table::as.data.table(ccl4) natadb <- get_chemicals_in_list('NATADB') natadb <- data.table::as.data.table(natadb)
We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.
dim(ccl4) dim(natadb) colnames(ccl4) head(ccl4, 1)
Next, physico-chemical properties for all chemicals in each list can be retrieved. The function get_chem_info()
will be used to batch search for a list of DTXSIDs.
ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid) natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)
Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, Warning: Setting type to ''!
, which indicates the the parameter type
was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4.
Before any deeper analysis, consider the dimensions of the data and the column names.
dim(ccl4_phys_chem) colnames(ccl4_phys_chem)
Next, we display the unique values for the columns propertyID
and propType
.
ccl4_phys_chem[, unique(propertyId)] ccl4_phys_chem[, unique(propType)]
Let's explore this further by examining the mean of the "boiling-point" and "melting-point" data.
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))] ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)), by = .(propType)] ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))] ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)), by = .(propType)]
These results tell us about some of the reported physico-chemical properties of the data sets.
The mean "boiling-point" is 252.6593 degrees Celsius for CCL4, with mean values of 250.5943 and 253.8196 for experimental and predicted, respectively. The mean "melting-point" is 34.91613 degrees Celsius for CCL4, with mean values of 23.18876 and 49.99417 for experimental and predicted, respectively.
To explore all the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property ("boiling-point", "melting-point", etc...), and then additionally group those by property type ("experimental" vs "predicted"). In the grouping, we look at the columns value
, unit
, propertyID
and propType
. We also demonstrate how take the mean of the values for each grouping, using the chemical identifier 'DTXSID1037567' for this example, the 25th chemical in CCL4.
head(ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], ]) ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(propType, value, unit), by = .(propertyId)] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(value, unit), by = .(propertyId, propType)] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)), by = .(propertyId, unit), .SDcols = c("value")] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)), by = .(propertyId, unit, propType), .SDcols = c("value")][order(propertyId)]
We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine "vapor-pressure", "henrys-law", and "boiling-point" and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and propType
value.
Begin by grouping pulled data by DTXSID, and also by DTXSID and property type.
ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)]
Examine summary statistics.
summary(ccl4_vapor_all) summary(ccl4_vapor_grouped) summary(natadb_vapor_all) summary(natadb_vapor_grouped)
With such a large range of values covering several orders of magnitude, log transform the data. This data from both chemical lists can also be plotted individually and by property type.
ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] ccl4_vapor_grouped[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] natadb_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] natadb_vapor_grouped[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) + geom_boxplot() + coord_flip() ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) + geom_boxplot()
ggplot(natadb_vapor_all, aes(log_transform_mean_vapor_pressure)) + geom_boxplot() + coord_flip() ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) + geom_boxplot()
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind()
.
ccl4_vapor_grouped[, set := 'CCL4'] natadb_vapor_grouped[, set := 'NATADB'] all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped) vapor_box <- ggplot(all_vapor_grouped, aes(set, log_transform_mean_vapor_pressure)) + geom_boxplot(aes(color = propType)) vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) + geom_boxplot((aes(color = set))) + coord_flip()
Plot the combined data. Boxplots are colored based on the property type, with mean log transformed vapor pressure plotted for each chemical list and property type, or by chemical list alone.
gridExtra::grid.arrange(vapor_box, vapor, ncol=2)
In the box plots above, a general trend indicates that that the NATADB chemical list has a higher mean vapor pressure than the CCL4 chemical list.
Henry's Law constant can be explored in a similar fashion. Begin by grouping data by DTXSID, and also by DTXSID and property type.
ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)]
Examine summary statistics.
summary(ccl4_hlc_all) summary(ccl4_hlc_grouped) summary(natadb_hlc_all) summary(natadb_hlc_grouped)
Again, log transform the data as it is positive and covers several orders of magnitude.
ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)] ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)] natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)] natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind()
.
ccl4_hlc_grouped[, set := 'CCL4'] natadb_hlc_grouped[, set := 'NATADB'] all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped) hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) + geom_boxplot(aes(color = propType)) hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) + geom_boxplot(aes(color = set)) + coord_flip()
gridExtra::grid.arrange(hlc_box, hlc, ncol=2)
Again, in both grouping by propType
and aggregating all results together by chemical list, NATADB chemicals generally higher mean Henry's Law Constant value than CCL4 chemicals.
Boiling Point data be explored. Begin by grouping data by DTXSID, and also by DTXSID and property type.
ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)]
Calculate summary statistics.
summary(ccl4_boiling_all) summary(ccl4_boiling_grouped) summary(natadb_boiling_all) summary(natadb_boiling_grouped)
Since some of the boiling point values have negative values, log transformation of these values will result in warnings as NaNs are produced.
Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind()
.
ccl4_boiling_grouped[, set := 'CCL4'] natadb_boiling_grouped[, set := 'NATADB'] all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped) boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) + geom_boxplot(aes(color = propType)) boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) + geom_boxplot(aes(color = set)) + coord_flip()
gridExtra::grid.arrange(boiling_box, boiling, ncol=2)
A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two chemical lists as it was in the previous physico-chemical properties investigated.
To summarize the observations, across the various physico-chemical properties for chemicals in these chemical lists, there are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental.
In this vignette, a variety of functions that access different types of data found in the Chemical
endpoints of the CTX APIs were explored. While this exploration was not exhaustive, it provides a basic introduction to how one may access data and work with it. Additional endpoints and corresponding functions exist and we encourage the user to explore these while keeping in mind the examples contained in this vignette.
# This chunk will be hidden in the final product. It serves to undo defining the # custom print function to prevent unexpected behavior after this module during # the final knitting process and restores original option values. knit_print.data.table = knitr::normal_print registerS3method( "knit_print", "data.table", knit_print.data.table, envir = asNamespace("knitr") ) options(old_options)
end_vignette()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.