The ToxCast(TM) Analysis Pipeline(tcpl)<br /> An R Package for Processing and Modeling Chemical Screening Data (Version 3.0)<br />

library(htmlTable)

Introduction

This vignette provides an overview of the tcpl package, including set up, database structure, and pre-processing. Assay registration, data processing, and data retrieval are topics covered in separate vignettes.

Overview

The ToxCast pipeline ( tcpl ) is an R package that manages, curve-fits, plots, and stores ToxCast data to populate its linked MySQL database, InvitroDB. The U.S. Environmental Protection Agency (EPA) ToxCast^TM^ program includes in vitro medium- and high-throughput screening assays for the prioritization and hazard characterization of thousands of chemicals of interest. These assays comprise Tier 2-3 of the new Computational Toxicology Blueprint, and employ automated chemical screening technologies, to evaluate the effects of chemical exposure on living cells and biological macromolecules, such as proteins (Thomas et al., 2019). More information on the ToxCast program can be found at https://www.epa.gov/chemical-research/toxicity-forecasting.

This flexible analysis pipeline is capable of efficiently processing and storing large volumes of data. The diverse data, received in heterogeneous formats from numerous vendors, are transformed to a standard computable format and loaded into the tcpl database by vendor-specific R scripts. Once data is loaded into the database, ToxCast utilizes generalized processing functions provided in this package to process, normalize, model, qualify, and visualize the data.

<font style="font-size:15px"><i>Conceptual overview of the ToxCast Pipeline functionality</i></font>

The original tcplFit() functions performed basic concentration response curve fitting. Processing with tcpl_v3 and beyond depends on the stand-alone tcplFit2 package to allow a wider variety of concentration-response models when using invitrodb in the 4.0 schema and beyond.^[Using tcpl_v3 with the schema from invitrodb versions 2.0-3.5 will still default to tcplFit() modeling with constant, Hill, and gain-loss] The main set of extensions includes all of the concentration-response models that are contained in the program BMDExpress. These include polynomial, exponential and power functions in addition to the original Hill, gain-loss and constant models. Similar to the program BMDExpress, tcplFit2 curve-fitting uses a defined Benchmark Response (BMR) level to estimate a benchmark dose (BMD), which is the concentration where the curve-fit intersects with this BMR threshold. One final addition was to let the hitcall value be a continuous number ranging from 0 to 1 (in contrast to binary hitcall values from tcplFit() . While developed primarily for ToxCast, the tcpl package is written to be generally applicable to the chemical-screening community.

The tcpl package includes processing functionality for two screening paradigms: (1) single-concentration screening and (2) multiple-concentration screening. Single-concentration screening consists of testing chemicals at one concentration, often for the purpose of identifying potentially active chemicals to test in the multiple-concentration format. Multiple-concentration screening consists of testing chemicals across a concentration range, such that the modeled activity can give an estimate of potency, efficacy, etc.

Prior to the pipeline processing provided in this package, all the data must go through pre-processing (level 0). Level 0 pre-processing utilizes dataset-specific R scripts to process the heterogeneous data into a uniform format and to load the uniform data into the tcpl database. Level 0 pre-processing is outside the scope of this package, but can be done for virtually any high-throughput or high-content chemical screening effort, provided the resulting data includes the minimum required information.

In addition to storing the data, the tcpl database stores every processing and analysis decision at the assay component or assay endpoint level to facilitate transparency and reproducibility. For the illustrative purposes of this vignette and others, we have included a CSV version of the tcpl database containing a small subset of data from the ToxCast program. tcplLite is no longer supported by tcpl because tcplFit2 can be used to curve-fit data and make hitcalls independent of invitrodb, available at https://cran.r-project.org/package=tcplfit2. tcplLite relied on flat files structured like invitrodb to produce curve-fitting and summary information like hitcalls and AC50 values. Functionally tcplFit2 replaces tcplLite because interested stakeholders can now curve-fit data and reproduce curve-fitting results independent of the invitrodb schema. For the ToxCast program it is still important to use invitrodb when curve-fitting as invitrodb serves as a data resource for tracking pipelining decisions and providing a dataset for many interested stakeholders. Using tcpl, the user can upload, process, and retrieve data by connecting to a MySQL database. Additionally, past versions of the ToxCast database, containing all the publicly available ToxCast data, are available for download at: https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data.

Package Settings

First, it is highly recommended for users to utilize the data.table package. The tcpl package utilizes the data.table package for all data frame-like objects. tcpl is dependent on the following two packages: tcplFit2 and plotly . The user must install and load them prior to loading the tcpl package.

library(data.table)
library(plotly)
library(tcplfit2)
library(tcpl)
## Store the path for the tcpl directory for loading data
pkg_dir <- system.file(package = "tcpl")

Every time the package is loaded in a new R session, a message similar to the following will print showing the default package settings:

tcpl (v1.3) loaded with the following settings:
  TCPL_DB:    C:/Users/user/R-3.4.4/library/tcpl/csv
  TCPL_USER:  NA
  TCPL_HOST:  NA
  TCPL_DRVR:  NA
Default settings stored in TCPL.conf. See ?tcplConf for more information.

The package consists of five settings:

  1. $TCPL_DB points to the tcpl database (either the path to the CSV directory, as in the given example above, or the name of the MySQL database),
  2. $TCPL_USER stores the username for accessing the database,
  3. $TCPL_PASS stores the password for accessing the database,
  4. $TCPL_HOST points to the MySQL server host, and
  5. $TCPL_DRVR indicates which database driver is used ("MySQL"). tcplLite is no longer supported and it is recommended to use tcplFit2 package for stand-alone applications.

Refer to ?tcplConf for more information. At any time, users can check the settings using tcplConfList() . An example of database settings using tcpl would be as follows:

tcplConf(db   = "invitrodb",
         user = "username", 
         pass = "password", 
         host = "localhost",
         drvr = "MySQL")

tcplConfList will list connection information. Note, tcplSetOpts will only make changes to the parameters given. The package is always loaded with the settings stored in the TCPL.config file located within the package directory. The user can edit the file, such that the package loads with the desired settings, rather than having to call the tcplSetOpts function every time. The TCPL.config file has to be edited whenever the package is updated or re-installed.

Database Structure

The following contains reference tables that describe the structure and fields found in the tcpl populated database. The first sections describe the data-containing tables, followed by sections describing the additional annotation tables.

In general, the single-concentration data and accompanying methods are found in the "sc#" tables, where the number indicates the processing level. Likewise, the multiple-concentration data and accompanying methods are found in the "mc#" tables. Each processing level that has accompanying methods will also have tables with the "_methods" and "_id" naming scheme. For example, the database contains the following tables: "mc5" storing the data from multiple-concentration level 5 processing, "mc5_methods" storing the available level 5 methods, and "mc5_aeid" storing the method assignments for level 5. Note, the table storing the method assignments for level 2 multiple-concentration processing is called "mc2_acid", because MC2 methods are assigned by assay component ID.

There are two additional tables, "sc2_agg" and "mc4_agg," that link the data in tables "sc2" and "mc4" to the data in tables "sc1" and "mc3," respectively. This is necessary because each entry in the database before SC2 and MC4 processing represents a single value; subsequent entries represent summary/modeled values that encompass many values. To know what values were used in calculating the summary/modeled values, the user must use the "_agg" look-up tables.

When using tcpl_v3 with invitrodb schemas v2.0-v3.5, tcplFit model data are structured in mc4 and mc5 tables that are in wide format with a fixed number of columns based on 3 curvefitting models (see documentation associated with tcpl_v2.1 ). When using tcpl_v3 with invitrodb schemas v4.0 or later, mc4 and mc5 tables have been updated to reflect having mc4_param and mc5_param tables. Tables should be reviewed together: mc4 captures summary values calculated for each concentration series, whereas mc4_param includes parameters for all models in long format. mc5 selects the winning model and activity hit call, whereas mc5_param includes model parameters from selected winning (hit) model in long format. These schema changes provide a way to continually expand modeling capabilities in tcpl .

Each of the methods tables have fields analogous to $\mathit{mc5_mthd_id}$, $\mathit{mc5_mthd}$, and $\mathit{desc}$. These fields represent the unique key for the method, the abbreviated method name (used to call the method from the corresponding mc5_mthds function), and a brief description of the method, respectively. The method assignment tables will have fields analogous to $\mathit{mc5_mthd_id}$ matching the method ID from the methods tables, an assay component or assay endpoint ID, and possibly an $\mathit{exec_ordr}$ field indicating the order in which to execute the methods. The method and method assignment tables will not be listed in the tables below to reduce redundancy.

Many of the tables also include the $\mathit{created_date}$, $\mathit{modified_date}$, and $\mathit{modified_by}$ fields that store helpful information for tracking changes to the data. These fields will not be discussed further or included in the tables below.

Many of the tables specific to the assay annotation are populated semi-manually based on expert curation of information on assay design; these tables of assay annotation are not currently utilized by the tcpl package, but instead act as meta-data for users. The full complexity of the assay annotation used by the ToxCast program is beyond the scope of this vignette and the tcpl package. Additionally, assay description documents for ToxCast assays can be found at: https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data.

Single-concentration Data-containing Tables

Field <- c("s0id ", "acid", "spid", "apid", "rowi", "coli", "wllt", "wllq", "conc", "rval", "srcf")
Description <- c("Level 0 ID",
                 "Assay component ID",
                 "Sample ID",
                 "Assay plate ID",
                 "Assay plate row index",
                 "Assay plate column index",
                 "Well type&dagger;",
                 "1 if the well quality was good, else 0;",
                 "Concentration is micromolar",
                 "Raw assay component value/readout from vendor",
                 "Filename of the source file containing the data"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 1: Fields in sc0 table.",
        tfoot="&dagger;Information about the different well types is available in Appendix B.")
Field <- c("s1id ", "s0id", "acid", "aeid", "logc", "bval", "pval", "resp")
Description <- c("Level 1 ID",
                 "Level 0 ID",
                 "Assay component ID",
                 "Assay component endpoint ID",
                 "Log base 10 concentration",
                 "Baseline value",
                 "Positive control value",
                 "Normalized response value"

                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 2: Fields in sc1 table."
)
Field <- c("aeid ", "s0id", "s1id", "s2id")
Description <- c("Assay component endpoint ID",
                 "Level 0 ID",
                 "Level 1 ID",
                 "Level 2 ID"

                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 3: Fields in sc2_agg table."
)
Field <- c("s2id ", "aeid", "spid", "bmad", "max_med", "hitc", "coff", "tmpi")
Description <- c("Level 2 ID",
                 "Assay component endpoint ID",
                 "Sample ID",
                 "Baseline median absolute deviation",
                 "Maximum median response value",
                 "Hit-/activity-call: 1 if active, 0 if inactive&dagger;",
                 "Efficacy cutoff value",
                 "Ignore, temporary index used for uploading purposes"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 4: Fields in sc2 table.",
        tfoot = "&dagger; As sc data are not curve-fit, the hitcalling procedure performed at sc2 remains binary (hitc=1 or hitc=0)."
)
Field <- c("s2id", "chid_rep")

Description <- c("Level 2 ID",
                 "Representative sample designation for a tested chemical: 1 if representative sample, else 0"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 5: Fields in sc2_chid.",


)

Multiple-concentration Data-containing Tables

The "mc0" table, other than containing $\mathit{m0id}$ rather than $\mathit{s0id}$, is identical to the "sc0" described in the section above.

Field <- c("m1id", "m0id", "acid", "cndx", "repi")
Description <- c("Level 1 ID",
                 "Level 0 ID",
                 "Assay component ID",
                 "Concentration index",
                 "Replicate index"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 6: Fields in mc1 table."
)
Field <- c("m2id", "m0id", "acid", "m1id", "cval")
Description <- c("Level 2 ID",
                 "Level 0 ID",
                 "Assay component ID",
                 "Level 1 ID",
                 "Corrected value"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 7: Fields in mc2 table."
)
Field <- c("m3id", "aeid", "m0id", "acid", "m1id", "m2id", "bval", "pval", "logc", "resp")
Description <- c("Level 3 ID",
                 "Assay endpoint ID",
                 "Level 0 ID",
                 "Assay component ID",
                 "Level 1 ID",
                 "Level 2 ID",
                 "Baseline value",
                 "Positive control value",
                 "Log base 10 concentration",
                 "Normalized response value"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 8: Fields in mc3 table."
)
Field <- c("aeid", "m0id", "m1id", "m2id", "m3id", "m4id")
Description <- c(
   "Assay endpoint ID","Level 0 ID",
                 "Level 1 ID",
                 "Level 2 ID",
                 "Level 3 ID",
                 "Level 4 ID"

                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 9: Fields in mc4_agg table."
)
Field <- c("m4id", "aeid", "spid", "bmad", "resp_max", "resp_min", "max_mean", "max_mean_conc", "max_med", "max_med_conc", "logc_max", "logc_min", 
           "nconc", "npts", "nrep", "nmed_gtbl", "tmpi")


Description <- c("Level 4 ID",
                 "Assay endpoint ID",
                 "Sample ID",
                 "Baseline median absolute deviation",
                 "Maximum response value",
                 "Minimum response value",
                 "Maximum mean response value",
                 "Log concentration at *max_mean*",
                 "Maximum median response value",
                 "Log concentration at *max_med*",
                 "Maximum log concentration tested",
                 "Minimum log concentration tested",
                 "Number of concentrations tested ",
                 "Number of points in the concentration series",
                 "Number of replicates in the concentration series",
                 "Number of median values greater than *3bmad*",
                 "Ignore, temporary index used for uploading purposes"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 10: Fields in mc4 table."
)
Field <- c("m4id", "aeid", "model", "model_param", "model_val")

Description <- c("Level 4 ID",
                 "Assay endpoint ID",
                 "Model that was fit",
                 "Key for the parameter that was fit with the corresponding model",
                 "Value for the associated key in the corresponding model"
                )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 11: Fields in mc4_param table."
)
Field <- c("m5id", "m4id", "aeid", "modl", "hitc", "fitc", "coff", "actp", "model_type")


Description <- c("Level 5 ID",
                 "Level 4 ID",
                 "Assay endpoint ID",
                 "Winning model",
                 "Hit-/activity-call, generally a continuous value from 0 to 1 if using *tcplFit2* fitting&dagger;" ,
                 "Fit category",
                 "Efficacy cutoff value",
                "Activity probability (1 - *const_prob* not used with *tcplFit2*)",
                "Model type placeholder for use when number of fitting methodologies increases"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 12: Fields in mc5 table.",
        tfoot = "&dagger; The continuous hitcalls produced resultant to *tcplFit2* curve-fitting are described in more detail in library(tcplFit2) and Sheffield et al. 2021 (https://doi.org/10.1093/bioinformatics/btab779)."
)
Field <- c("m5id", "aeid", "hit_param", "hit_val")

Description <- c("Level 5 ID",
                 "Assay endpoint ID",
                 "Key for the parameter that was fit with winning model",
                 "Value for the associated key in the winning model"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 13: Fields in mc5_param table."
)
Field <- c("m5id", "chid_rep")

Description <- c("Level 5 ID",
                 "Representative sample designation for a tested chemical: 1 if representative sample, else 0"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 14: Fields in mc5_chid." )

Assay and Auxiliary Annotation Tables

The definition of an "assay" is, for the purposes of this package, broken into:

assay_source: the vendor/origination of the data

assay: the procedure to generate the component data

assay_component: the raw data readout(s)

assay_component_endpoint: the normalized component data

Each assay element is represented by a separate table in the tcpl database. In general, we refer to an "assay_component_endpoint" as an "assay endpoint." As we move down the hierarchy, each additional layer may have a one-to-many relationship with the previous layer. For example, an assay source can include multiple assays. Given bidirectional fitting in tcpl_v3 , in many cases, a single assay endpoint may be derived from each component, as the "up" and "dn" curve-fitting directions will no longer be separated into different assay endpoints.

All processing occurs by assay component or assay endpoint, depending on the processing type (single-concentration or multiple-concentration) and level. No data are stored at the assay or assay source level. The “assay” and “assay_source” tables store annotations to help in the processing and down-stream understanding of the data.

Throughout the package, the levels of assay hierarchy are defined and referenced by their primary keys (IDs) in the tcpl database: $\mathit{asid}$ (assay source ID), $\mathit{aid}$ (assay ID), $\mathit{acid}$ (assay component ID), and $\mathit{aeid}$ (assay endpoint ID). In addition, the package abbreviates the fields for the assay hierarchy names. The abbreviations mirror the abbreviations for the IDs with "nm" in place of "id" in the abbreviations, e.g. assay_component_name is abbreviated $\mathit{acnm}$.

A full description of the assay annotation is beyond the scope of this vignette. The fields pertinent to the tcpl package are listed in the tables below.

Field <- c("assay", "assay_component", "assay_component_endpoint", "assay_component_map", "assay_reagent**", "assay_reference**", "assay_source", "chemical", "chemical_library", "citations**", "gene**", "intended_target**", "organism**", "sample")

Description <- c("Assay-level annotation",
                 "Assay component-level annotation",
                 "Assay endpoint-level annotation",
                 "Assay component source names and their corresponding assay component ids",
                 "Assay reagent information",
                 "Map of citations to assay",
                 "Assay source-level annotation",
                 "List of chemicals and associated identifiers",
                 "Map of chemicals to different chemical libraries",
                 "List of citations",
                 "Gene identifiers and descriptions",
                 "Intended assay target at the assay endpoint level",
                 "Organism identifiers and descriptions",
                 "Sample ID information and chemical ID mapping")

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        caption="Table 15: List of annotation tables.",
        tfoot = "** indicates tables not currently used by the *tcpl* package",
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em '
)
Field <- c("aid", "asid&dagger;", "assay_name&dagger;", "assay_desc", "timepoint_hr", 
            "organism_id", "organism",'tissue',"cell_format",
            'cell_free_component_source',
            'cell_short_name', 
            'cell_growth_mode',
            "assay_footprint&dagger;", 
            "assay_format_type" ,
            "assay_format_type_sub" ,
            "content_readout_type",  
            "dilution_solvent" , 
            "dilution_solvent_percent_max")

Description <- c("Assay ID",
                 "Assay source ID",
                 "Assay name (abbreviated \"anm\" within the package)",
                 "Assay description",
                 "Treatment duration in hours",
                 "NCBI taxonomic identifier, available here <https://www.ncbi.nlm.nih.gov/taxonomy>",
                "Organism of origin",
                "Tissue of origin", "Description of cell format",
                "Description of source for targeted cell-free components",
                "Abbreviation of cell line",
                "Cell growth modality", 
                "Microtiter plate size",
                "General description of assay format",
                "Specific description of assay format" ,
                "Description of well characteristics being measured", 
                "Solvent used in sample dilution",
                "Maximum percent of dilution solvent used, from 0 to 1.")

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        caption="Table 16: Fields in assay.",
        tfoot = "&dagger; Required fields for registration",
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em '
)
Field <- c("acid", "aid&dagger;", "assay_component_name&dagger;", "assay_component_desc", "assay_component_target_desc", "parameter_readout_type","assay_design_type", "assay_design_type_sub", "biological_process_target", "detection_technology_type", "detection_technology_type_sub", "detection_technology", "key_assay_reagent_type", "key_assay_reagent", "technological_target_type", "technological_target_type_sub")

Description <- c("Assay component ID",
                 "Assay ID",
                 "Assay component name (abbreviated \"acnm\" within the package)",
                 "Assay component description", 
                 "Assay component target description. Generally includes information about mechanism of action with assay target, how disruption is detected, or significance of target disruption.",
                 "Description of parameters measured", 
                "General description of the biological or physical process is translated into a detectable signal by assay mechanism",
                "Specific description of method through which a biological or physical process is translated into a detectable signal measured",
                "General biological process being chemically disrupted",
                "General description of assay platform or detection signals measured",
                "Description of signals measured in assay platform",
                "Specific description of assay platform used",
                "Type of critical reactant being measured",
                "Critical reactant measured",
                "General description of technological target measured in assay platform",
                "Specific description of technological target measured in assay platform"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        tfoot = "&dagger; Required fields for registration",
        caption="Table 17: Fields in assay_component."
)
Field <- c("asid&dagger;", "assay_source_name&dagger;", "assay_source_long_name", "assay_source_desc")

Description <- c("Assay source ID",
                 "Assay source name (typically an abbreviation of the assay_source_long_name, abbreviated \"asnm\" within the package)",
                 "Full assay source name", 
                 "Assay source description"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        tfoot = "&dagger; Required fields for registration",
        caption="Table 18: Fields in assay_source."
)
Field <- c("aeid", "acid", "assay_component_endpoint_name", "assay_component_endpoint_desc", "assay_function_type", "normalized_data_type&dagger;", "burst_assay&dagger;", "key_positive_control", "signal_direction", "intended_target_type", "intended_target_type_sub", "intended_target_family", "intended_target_family_sub", "fit_all&dagger;", "cell_viability_assay")

Description <- c("Assay component endpoint ID",
                 "Assay component ID",
                 "Assay component endpoint name (abbreviated \"aenm\" within the package)", 
                 "Assay component endpoint description",
                 "Description of targeted mechanism and the purpose of the analyzed readout in relation to others from the same assay",
                 "Normalization approach for which the data is displayed",
                 "Indicator if endpoint is included in the burst distribution (1) or not (0); Burst phenomenon can describe confounding activity, such as cytotoxicity due to non-specific activation of many targets at certain concentrations", 
                 "Tested chemical sample expected to produce activity; Used to assess assay validity",
                 "Directionality of raw data signals from assay (gain or loss); Defines analysis direction",
                "General group of intended targets measured",
                "Specific subgroup of intended targets measured", 
                "Family of intended target measured; Populated on ToxCast chemical activity plot within CompTox dashboard",
                "Specific subfamily of intended target measured",
                "Indicator if all results should be fit, regardless of whether max_med surpasses 3bmad cutoff (1) or not (0)",
                "Indicator of the impact of cytotoxicity in confounding (1) or no cytotoxic impact (0)" )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 19: Fields in assay_component_endpoint.",
        tfoot = "&dagger; Required fields for registration"
)
Field <- c("acid", "acsn")

Description <- c("Assay component ID",
                 "Assay component source name"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 20: Fields in assay_component_map table."
)
Field <- c("chid", "casn", "chnm", "dsstox_substance_id")

Description <- c("Chemical ID&dagger;",
                 "CAS Registry Number",
                 "Chemical name",
                 "Unique identifier from U.S. EPA Distributed Structure-Searchable Toxicity (DSSTox) Database"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 21: Fields in chemical table." ,
        tfoot = "&dagger; This is the DSSTox GSID within the ToxCast data, but can be any integer and will be auto-generated (if not explicitly defined) for newly registered
chemicals"
)
Field <- c("chid", "clib")

Description <- c("Chemical ID",
                 "Chemical library"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 22: Fields in chemical_library table." 

)
Field <- c("spid", "chid", "stkc", "stkc_unit", "tested_conc_unit")

Description <- c("Sample ID",
                 "Chemical ID",
                 "Stock concentration" ,
                 "Stock concentration unit",
                 "The concentration unit for the concentration values in the data-containing tables"
                 )

output <- 
  data.frame(Field, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 23: Fields in sample table." 
)

The stock concentration fields in the "sample" table allow the user to track the original concentration when the neat sample is solubilized in vehicle before any serial dilutions for testing purposes. The US EPA's ChemTrack application and database supports chemical procurement and sample management for ToxCast in vitro screening efforts.

Level 0 Pre-processing

Level 0 pre-processing can be done on virtually any high-throughput/high-content screening application. In the ToxCast program, level 0 processing is done in R by vendor/dataset-specific scripts. The individual R scripts act as the "laboratory notebook" for the data, with all pre-processing decisions clearly commented and explained. The standard Level 0 format to enter the pipeline is identical between testing paradigms, single concentration (sc) and multi-concentration (mc).

Level 0 pre-processing has to reformat the raw data into the standard format for the pipeline, and also can make manual transformations to the data as pre-normalization steps. All manual transformations to the data should be very well documented with justification. Common examples of manual transformations include fixing a sample ID typo, or changing well quality value(s) to 0 after finding obvious problems like a plate row/column missing an assay reagent.

Each row in the level 0 pre-processing data represents one well-assay component combination, containing 11 fields. The only field in level 0 pre-processing not stored at level 0 is the assay component source name ($\mathit{acsn}$). The assay component source name should be some concatenation of data from the assay source file that identifies the unique assay components. When the data are loaded into the database, the assay component source name is mapped to assay component ID through the assay_component_map table in the tcpl database. Assay components can have multiple assay component source names, but each assay component source name can only map to a single assay component.

Field <- c("acsn", "spid", "cpid", "apid", "rowi", 
           "coli", "wllt", "wllq", "conc", "rval", "srcf")

Description <- c("Assay component source name",
                 "Sample ID",
                 "Chemical plate ID" ,
                 "Assay plate ID",
                 "Assay plate row index, as an integer",
                 "Assay plate column index, as an integer",
                 "Well type",
                 "1 if the well quality was acceptable, else 0",
                 "Concentration in micromolar",
                 "Raw assay component value or readout from vendor",
                 "Filename of the source file containing the data"
                 )
`N/A` <- c("No", "No", "Yes", "Yes","Yes","Yes", "No", "No", "No&dagger;", "Yes&ddagger;", "No")

output <- 
  data.frame(Field, Description, `N/A`)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',

          caption="Table 24: Required fields in level 0 pre-processing." ,
          tfoot = "The N/A column indicates whether the field can be N/A in the pre-processed data. 
 &dagger; In past versions of *tcpl*, there were some exceptions where concentrations could be N/A. For *tcpl_v3*, conc values must be numeric for processing since N/A values will result in processing error.  
 &ddagger;If the raw value is N/A, well quality must be 0."
)

The well type field is used in the processing to differentiate controls from test compounds in numerous applications, including normalization and definition of the assay noise level. Currently, the tcpl package includes the eight well types in Table 29. Package users are encouraged to suggest new well types and methods to better accommodate their data.

`Well Type` <- c("t", "c", "p", "n", "m",  "o", "b", "v")

Description <- c("Test compound",
                 "Gain-of-signal control in multiple concentrations",
                 "Gain-of-signal control in single concentration" ,
                 "Neutral/negative control",
                 "Loss-of-signal control in multiple concentrations",
                 "Loss-of-signal control in single concentration",
                 "Blank well",
                 "Viability control"
                 )


output <- 
  data.frame(`Well Type`, Description)

htmlTable(output,
        align = 'l',
        align.header = 'l',
        rnames = FALSE  ,
        css.cell =  ' padding-bottom: 5px;  vertical-align:top; padding-right: 10px;min-width: 5em ',
        caption="Table 25: Well types" 
)

The final step in level 0 pre-processing is loading the data into the tcpl database. The tcpl package includes the tcplWriteLvl0 function to load data into the database. The tcplWriteLvl0 function maps the assay component source name to the appropriate assay component ID, checks each field for the correct class, and checks the database for the sample IDs with well type "t." Each test compound sample ID must be included in the tcpl database before loading data. The tcplWriteLvl0 also checks each test compound for concentration values.

Appendix

A: Cytotoxicity Distribution

Recognizing the substantial impact of cytotoxicity in confounding high-throughput and high-content screening results, the tcpl package includes methodology for defining chemical-specific cytotoxicity estimates. Our observations based on ToxCast data suggest a complex, and not-yet fully understood cellular biology that includes non-specific activation of many targets as cells approach death. For example, a chemical may induce activity in an estrogen-related assay, but if that chemical also causes activity in hundreds of other assays at or around the same concentration as cytotoxicity, should the chemical be called an estrogen agonist? The tcplCytpPt function provides an estimate of chemical-specific cytotoxicity points to provide some context to the "burst" phenomenon.

The cytotoxicity point is simply the median AC$_{50}$ for a set of assay endpoints, either given by the user or defined within the tcpl database. By default, the tcplCytoPt function uses the assay endpoints listed in the $\mathit{burst_assay}$ field of the "assay_component_endpoint" table, where 1 indicates including the assay endpoint in the calculation. The "burst" assay endpoints can be identified by running tcplLoadAeid(fld = "burst_assay", val = 1) .

In addition to the cytotoxicity point, tcplCytoPt provides two additional estimates: (1) the MAD of the AC$_{50}$ ($\mathit{modl_ga}$) values used to calculate the cytotoxicity point, and (2) the global MAD. Note, only active assay endpoints (where the hit call, $\mathit{hitc}$, equals $1$) are included in the calculations. Once the burst distribution (cytotoxicity point and MAD) is defined for each chemical, the global burst MAD is defined as the median of the MAD values. Not every chemical may be tested in every "burst" assay, so the user can determine the minimum number of tested assays as a condition for the MAD value for a particular chemical to be included in the global MAD calculation. By default, if "aeid" is the vector of assay endpoints used in the calculation, tcplCytoPt requires the chemical to be tested in at least floor(0.8 * length(aeid)) assay endpoints to be included in the calculation. The user can specify to include all calculated MAD values (note, there must be at least two active assay endpoints to calculate the MAD) by setting 'min.test' to FALSE . The 'min.test' parameter also accepts a number, allowing the user to explicitly set the requirement.

The global MAD gives an estimate of overall cytotoxicity window, and allows for a cytotoxicity distribution to be determined for chemicals with less than two active "burst" assay endpoints. The cytotoxicity point for chemicals with less than two active "burst" endpoints is set to the value given to the 'default.pt' parameter. By default, the tcplCytoPt assigns 'default.pt' to 3.^[$10^3 = 1000$, therefore, when using micromolar units, $3$ is equivalent to $1$ millimolar. $1$ millimolar was chosen as an arbitrary high concentration (outside the testing range for ToxCast data), based on the principle that all compounds are toxic if given in high enough concentration.]

B: Build Variable Matrices

The tcplVarMat function creates chemical-by-assay matrices for the level 4 and level 5 data. When multiple sample-assay series exist for one chemical, a single series is selected by the tcplSubsetChid function. See ?tcplSubsetChid for more information.

  1. "modl_ga" -- The $\log_{10}\mathit{AC_{50}}$ (in the gain direction) for the winning model.
  2. "hitc" -- The hit call for the winning model.
  3. "m4id" -- The m4id, listing the concentration series selected by tcplSubsetChid . <br>
  4. "zscore" -- The z-score (described below). <br>
  5. "tested_sc" -- $1$ or $0$, $1$ indicating the chemical/assay pair was tested in the single-concentration format. <br>
  6. "tested_mc" -- $1$ or $0$, $1$ indicating the chemical/assay pair was tested in the multiple-concentration format. <br>
  7. "ac50" -- a modified AC$_{50}$ table (in non-log units) where assay/chemical pairs that were not tested, or tested and had a hit call of $0$ or $-1$ have the value $1e6$. <br>
  8. "neglogac50" -- $-\log_{10}\frac{\mathit{AC_{50}}}{1e6}$ where assay/chemical pairs that were not tested, or tested and had a hit call of $0$ or $-1$ have the value $0$. <br>

The z-score calculation is based on the output from tcplCytoPt (Appendix C), and is calculated for each AC$_{50}$ value as follows: $$ \mathit{z-score} = -\frac{\mathit{modl_ga} - \mathit{cyto_pt}}{\mathit{global_mad}}\mathrm{,} $$ Note: the burst z-score values are multiplied by -1 to make values that are more potent relative to the burst distribution a higher positive z-score.

In addition, additional matrices can be defined by the 'add.vars' parameter in the tcplCytoPt function. The 'add.vars' function will take any level 4 or level 5 field and create the respective matrix.



Try the tcpl package in your browser

Any scripts or data that you put into this service are public.

tcpl documentation built on Oct. 7, 2023, 1:06 a.m.