Introduction to *In Vitro-In Vivo* Extrapolation (IVIVE) with R Package httk"

Please send questions to


Chemical risk assessment depends on knowledge of inherent chemical hazard, the dose-response relationship, and the extent of exposure that occurs (NASEM 2017). High throughput screening (HTS) for in vitro bioactivity allows characterization of potential hazard for thousands of chemicals for which no other testing has occurred (Judson et al., 2009).

Toxicokinetics (TK) describes the Absorption, Distribution, Metabolism, and Excretion (ADME) of a chemical by the body. TK relates external exposures to internal tissue concentrations of chemical and therefore informs the chemical dose-response relationship (Coecke et al., 2013). TK requires chemical-specific information that is traditionally obtained in vivo. However, most chemicals do not have TK data, so a new approach methodology (NAM) uses in vitro TK methods adapted from the pharmaceutical industry to provide the necessary chemical-specific information (Wambaugh et al., 2019, Chang et al., 2022). High(er) throughput toxicokinetics (HTTK) involves the combination of chemical-specific in vitro TK data and chemical-independent ("generic") TK models to predict TK (Breen et al., 2021).

The primary goal of HTTK is to provide a human dose context for bioactive in vitro concentrations from HTS (that is, in vitro-in vivo extrapolation, or IVIVE) (Wetmore, 2015). IVIVE is Utilization of in vitro experimental data to predict phenomena in vivo. In pharmaceutical development, HTTK methods allow IVIVE to estimate therapeutic doses for clinical studies – predicted concentrations are typically on the order of values measured in clinical trials (Wang, 2010). High throughput screening + IVIVE can predict a daily chemical dose rate (mg/kg bw/day) that might be adverse (Paul Friedman et al., 2019).

A secondary goal is to provide open source data and models for evaluation and use by the scientific community (Pearce et al., 2017).

The following material is from a lecture taught as part of the course "Computational Methods in Chemical Risk Assessment" at Rutgers University in September, 2019.

Prepare R to run the vignette

R package knitr generates html and PDF documents from this RMarkdown file, Each bit of code that follows is known as a "chunk". We start by telling knitr how we want our chunks to look.

knitr::opts_chunk$set(collapse = TRUE, comment = '#>')
options(rmarkdown.html_vignette.check_title = FALSE)

Clear the memory

It is a bad idea to let variables and other information from previous R sessions float around, so we first remove everything in the R memory.


HTTK Version

This vignette was created using httk v2.2.2 in 2022. Although we attempt to maintain backward compatibility, if you encounter issues with the latest release of httk and cannot easily address the changes, historical versions of httk are available from:


If you are familiar with computer files and R you might want to skip directly to "Step 1: Loading In Vitro Bioactivity Data".

Files and Folders

R (or any computer program) needs to know where to look to find particular information (such as data generated by a panel of in vitro assays). We describe a chunk of information as a "file". Files range in size and can be small (even 0 bytes) or very large (several GB). A file my physically be located on a disk drive inside your computing device or, increasingly, it might be located somewhere in a cloud storage network. Regardless, most computers will treat a file as if it has a specific location. Each file has a name, although multiple files can have the same name as long as those files are stored in different "folders".

Files for all the operating systems with R are organized using a folder (also called a "directory") and sub-folder structure. A sub-folder is a folder within another folder. They can be well organized:


Or not:


The sequence of folders and sub-folders leading to a particular file is referred to as a "path". For information on how folder structures work, please check out these sites:

Within the command shell you use the "cd" command to "change directories".

One complication to watch out for is the direction of the symbol that separates folder names in the path. On Windows the path is separated by "back-slashes": \


On Unix-based operating systems (including Linux and MacOS) the path is separated by "forward slashes": /


R always uses Unix forward slashes, even on a Windows machine, so if you know that the path is:


You need to tell R that you want:


Finally, the bit at the front of the path (such as "C:\" or "L:\" tells the computer which "drive" to look in to find your folders. Often "C drive" (c:\) is physically inside your machine while other letters might indicate network (or cloud) drives.

Home Directory

Whether you are on Linux or Windows you have a home directory. Often you will only be able to write and edit files within your home directory and its sub-directories. However, you may be able to read and copy from most other folders.

On Windows your home directory will be a sub-directory of "C:\users\". For example, "C:\users\jwambaug".

Common and Important Directories

You will want to create a directory for your R packages locally on your machine:


Download Folders

Many web browsers put their files in a default directory such as


Temporary (or "Swap") Space

On Windows the path "c:\" refers to the physical hard drive inside your computer. Therefore the only copy of "c:\users\USERNAME\" is on your computer and if that drive or the whole computer is damaged or destroyed you have lost those files. For this reason we call this "temporary" space. You don't want to keep the only copies of your files there.

Installing Necessary Software

Installing R package "httk"


Installing R package "readxl"


Installing R package "ggplot2"


Load the necessary packages:


Control run speed

Portions of this vignette use Monte Carlo sampling to simulate variability and propagate uncertainty. The more samples that are used, the more stable the results are (that is, the less likely they are to change if a different random sequence is used). However, the more samples that are used, the longer it takes to run. So, to speed up how fast these examples run, we specify here that we only want to use 25 samples, even though the actual httk default is 1000. Increase this number to get more stable (and accurate) results:

NSAMP <- 25

Step One: Loading In Vitro Screening Data

The ToxCast and Tox21 research programs employ batteries of high-throughput assays to assess chemical bioactivity in vitro. Not every chemical is tested through every assay (Richard et al., 2016). Most assays are conducted in concentration response, and each corresponding assay endpoint is analyzed statistically to determine if there is a concentration-dependent response or “hit” using the ToxCast Pipeline (Filer et al., 2017). Most assay endpoint-chemical combinations are nonresponsive. Here, only the hits are treated as potential indicators of bioactivity.

In vitro bioactivity does not necessarily indicate adversity or hazard. Biological models can be used to make predictions of toxicity based on in vitro data (for example, Sipes et al., 2011, Browne et al., 2015 and Kleinstreuer et al., 2017). However, here we make a more conservative, precautionary assumption that concentrations too low to cause in vitro bioactivity are more likely to be safe (Paul Friedman et al., 2019). Among all of the assay hits for each chemical, we choose to use the lower 10th percentile of the $\mu M$ potencies (50% active concentration or $AC_{50}$). We are assuming that the 10th percentile represents a low concentration for activating multiple assays (assuming 10 or more bioactive assays were observed for a given chemical). However, this bioactivity does not necessarily have a direct toxicological interpretation.

Downloading ToxCast Data

The main page for the data is here:

Most useful to us is a single file containing all the hits across all chemicals and assays:

As of November, 2022 the most recent version was 3.5 and was available as an .Rdata file (invitrodb_3_5_mc5.Rdata), which we can download and place in our working directory (FOLDERPATH).

Loading the Data into R

Don't forget to use "setwd(FOLDERPATH)" to change to your R working folder:


Then load the file:


We now have a file "mc5" in our woking environment. It's absolutely huge, so lets shrink it down to contain only the chemicals we care about.

This list could be chemicals with HTTK data (note, this step cuts things down to chemicals with human HTTK data, you could change species to rat and get something different).

toxcast.httk <- subset(mc5, dsstox_substance_id %in% get_cheminfo(

Critical Step: Select your chemicals of interest

We might have chemicals we are interested in for one reason or another -- you could type any chemical ID's you want into the following my.chems vector, or even load it from a file.

Here we'll pick 50 chemicals at random from among the ToxCast chemicals:

my.chems <- sample(mc5$dsstox_substance_id,50)
example.toxcast <-[mc5$dsstox_substance_id %in% my.chems,])

Unfortunately for this vignette there are too many ToxCast data to fit into a 5mb R package. So we will subset to just the selected chemicals and distribute only those data. In addition, out of 78 columns in the data, we will keep only eight. Download the full data following the instructions above.

example.toxcast <- example.toxcast[, c("chnm",
# reduce precision to decrease space:
cols <- c("modl_ga", "modl_ac10", "modl_acc")
for (this.col in cols)
  example.toxcast[, this.col] = signif(example.toxcast[, this.col], 3)

Interpreting the Assay Information

In the mc5 table, "aenm" gives the assay name while "dsstox_substance_id" is the chemical identity from the CompTox Chemicals Dashboard. Since both names and CAS-RN can be subject to variations, we track chemical identities with the DSSTox Substance ID’s or “DTXSIDs” (Grulke et al., 2019).

The column "hitc" is 1 (that is, "TRUE") if there was a statistically-signficant systematic concentration-response observed (this is a "hit"). "hitc" is 0 (FALSE) if there was no hit. The column "mc6_flags" indicates features of the hit that might be of interest for further examination but are not necessarily disqualifying.

The column "modl" indicates the type of concentration-response modl that was selected (based on Akaike Information Criterion) to best describe the data. The possibilities are constant ("cnst" -- no concentation response), Hill equation ("hill"), or gain-loss model ("gnls" -- the product of an increasing and decreasing Hill equations). "gnls" is often interpreted as being caused by cytotoxicity.

The column "modl_ga" indicates the log10 uM concentration at which the chemical caused 50% activity. This is also called the chemical-assay “potency” or “$AC_{50}$”.

Each chemical might have multiple assays for which there is a hit. Each chemical may have been run multiple times (experimental replicates), so that there may be multiple samples from the same chemical for the same assay. Different samples are indicated by the column “spid” (sample id).

example.toxcast <- httk::example.toxcast
knitr::kable(head(example.toxcast), caption = "ToxCast In Vitro Bioactivity Data",

Find the concentrations of interest

Then we can build a table summarizing the in vitro data. For a "hit", the "AC50" indicates the 50% activation concentration from the best curve-fit, "AC10" indicates the concentration estimated to cause 10%, and "ACC" is the "activity concentration at cutoff" are which is the concentration that first causes statistically significant activity. See Filer et al. (2017) figure three for additional discussion.

toxcast.table <- NULL
for ( in unique(example.toxcast$dsstox_substance_id))
  this.subset <- subset(example.toxcast, dsstox_substance_id ==
  these.hits <- subset(this.subset, hitc==1)
  if (dim(these.hits)[1]>0){
      this.row <- data.frame(Compound=as.character(this.subset[1,"chnm"]),
                         Total.Assays = dim(this.subset)[1],
                         Unique.Assays = length(unique(this.subset$aeid)),
                         Total.Hits = dim(these.hits)[1],
                         Unique.Hits = length(unique(these.hits$aeid)),
                         Low.AC50 = signif(min(these.hits$modl_ga),3),
                         Low.AC10 = signif(min(these.hits$modl_ac10),3),
                         Low.ACC = signif(min(these.hits$modl_acc),3),
                         Q10.AC50 = signif(quantile(these.hits$modl_ga,probs=0.1),3),
                         Q10.AC10 = signif(quantile(these.hits$modl_ac10,probs=0.1),3),
                         Q10.ACC = signif(quantile(these.hits$modl_acc,probs=0.1),3),
                         Med.AC50 = signif(median(these.hits$modl_ga),3),
                         Med.AC10 = signif(median(these.hits$modl_ac10),3),
                         Med.ACC = signif(median(these.hits$modl_acc),3),
                         Q90.AC50 = signif(quantile(these.hits$modl_ga,probs=0.9),3),
                         Q90.AC10 = signif(quantile(these.hits$modl_ac10,probs=0.9),3),
                         Q90.ACC = signif(quantile(these.hits$modl_acc,probs=0.9),3)
    toxcast.table <- rbind(toxcast.table, this.row)
rownames(toxcast.table) <- seq(1,dim(toxcast.table)[1])
knitr::kable(head(toxcast.table[,1:6]), caption = "Summarized ToxCast Data",

Step Two: High Throughput Toxicokinetics from In Vitro Data

For each chemical in your subset of ToxCast, add the HTTK plasma steady-state concentration if it is available. calc_mc_css() gives us the predicted steady-state plasma concentration resulting from an ongoing 1 mg/kg/day exposure.

Steady State Plasma Concentration $C_{ss}$

for ( in unique(toxcast.table$DTXSID))
# get_cheminfo() gives a list of all the CAS numbers for which HTTK will work:
  if ( %in% get_cheminfo(info="dtxsid", suppress.messages=TRUE))
# Set a random number generator seed so that the Monte Carlo will always give
# the same random sequence:
    toxcast.table[toxcast.table$,"Css"] <- 
    toxcast.table[toxcast.table$,"Css.Type"] <- "in vitro"

Let's look at just the relevant columns from our table. Any Css values of "NA" (not a number) indicate that we don't currently have in vitro TK data for those chemicals:

                                caption = "Summarized ToxCast Data",

Step Three: High Throughput Toxicokinetics with QSPR Predictions

For chemicals where no in vitro toxicokinetic data are available, we can sometimes load predictions from in silico models. We do not provide in silico models for $f_{up}$ and $Cl_{int}$ themselves (in many cases those models are much larger in terms of file size than the whole httk package). However we do have tables of pre-made predictions form a variety of models. For example, you can load the table of quantitative structure-property relationship (QSPR) model predictions from the Sipes et al. (2017) supplemental material:


Now go through the list of chemicals again but only calculate if we didn't already have a value (NOTE: load_sipes2017 will not overwrite the in vitro data by default, so you'll still get the same answer if you use the same random number seed, but you would also be wasting time):

for ( in unique(toxcast.table$DTXSID))
  if ( %in% get_cheminfo(info="dtxsid", suppress.messages=TRUE) &[toxcast.table$,"Css"]))
# Set a random number generator seed so that the Monte Carlo will always give
# the same random sequence:
    toxcast.table[toxcast.table$,"Css"] <- 
    toxcast.table[toxcast.table$,"Css.Type"] <- "in silico"

Take another look at our table -- many of the gaps are now filled in. Any Css values of "NA" (not a number) indicate that we don't currently have in vitro TK data or in silico predictions from Sipes et al. (2017) for those chemicals (note that there are also predictions available from Pradeep et al. (2020) (load_pradeep2020) and Dawson et al. (2021) (load_dawson2021):

                                caption = "Summarized ToxCast Data",

Step Four: Reverse Dosimetry In Vitro-In Vivo Extrapolation

Using toxicokinetics to predict tissue concentrations resulting from a known dose is known as "forward dosimetry". Using toxicokinetics to estimate the dose that might have led to a known tissue cocentration is known as "reverse dosimetry" (Clewell, et al., 2008). httk allows the use of reverse dosimetry to calculate the equivalent steady-state dose to produce a plasma concentration equal to a bioactive concentration identified in vitro (Breen et al., 2021). For example, we can start from the tenth percentile ToxCast $AC_{50}$ and find the administered equivalent dose that would cause that concentration in the plasma. We do this by dividing the in vitro concentration by the $C_{ss}$ from 1 mg/kg/day, ending up with a dose rate with units of mg/kd/day that will product the in vitro concentration: $$AED = \frac{AC_{50}}{C_{ss}}$$ Don't forget the ToxCast concentrations are given on the log-10 scale:

toxcast.table$EquivDose <- signif(10^toxcast.table$Q10.AC50 / toxcast.table$Css,

Look at the table again:

                                 caption = "Summarized ToxCast Data",

Step Five: Compare with Estimated Exposure Rates

The U.S. National Academy of Sciences, Engineering, and Medicine advises calculating chemical risk posed to the public health on the basis of hazard, the dose-response relationship, and exposure (NRC, 1983). We can use in vitro bioactivity as a surrogate for hazard and httk as a surrogate for the dose-reponse relationship. However, to prioritize chemicals on the basis of putative risk we need some estimate of human exposure. Because most chemicals lack measured data characterizing population intake rates (mg/kg body weight/ day) (Egeghy, 2012), a consensus prediction from EPA's Systematic Empirical Evaluation of Models (SEEM) is often the only option available for exposures. The SEEM consensus prediction is composed of multiple exposure predictors aggregated on the basis of likely exposure pathways (Ring, 2018). The individual exposure predictors are weighted based upon their ability to predict intake rates that could be inferred from biomonitoring data (Ring, 2017; Wambaugh, 2014).
The SEEM exposure forecasts are uncertain, and the upper boundary of the 95% credible interval is used here. Although a 95% credible interval is calculated, this reflects uncertainty but not variability -- the (Ring, 2018) model predicts population median exposure rates for the U.S. population.

There are three ways to access the SEEM predictions. If you can access the journal article online, we can load the SEEM predictions from Supplemental Table 1 of the Supporting Information for Ring et al. (2018). Don't forget that you have to "unzip" a file that ends in ".zip" and place the ".txt" file in your working directory with your other files.

SEEM <- read.csv("SupTable-all.chem.preds-2018-11-28.txt",stringsAsFactors=F)

If you want, you can also download the predictions for specific chemicals from the Comptox Chemicals Dashboard. Be sure to use Batch Search and select the option for "NHANES/Predicted Exposure". You can download the file in a "comma separated values" or ".CSV" format, in which case you use "read.csv". Or you can download an Excel ".xls" file which requires load the package readXL.

SEEM <- read_excel("CCD-Batch-Search_DATE.xlsx",sheet=2)

Perhaps most easily we can grab SEEM predictions already in RData format from Download the file Ring2018Preds.RData and load it:

SEEM <- Ring2018.preds

Again we do not have the space to distribute all the SEEM predictions within this R package, but we can give you our example chemicals:

example.seem <-, 
                              dsstox_substance_id %in% toxcast.table$DTXSID))
for (this.col in 4:36) 
  example.seem[,this.col] <- signif(example.seem[,this.col],4)

Merge the consensus model predictions with your table:

example.seem <- httk::example.seem
toxcast.table <- merge(toxcast.table,
                       all.x = TRUE)

Now we can calculate a bioactivity:exposure ratio (BER), which is a risk-like metric since it takes hazard, dose-response, and exposure into account (Wetmore, 2015):

toxcast.table$BER <- signif(toxcast.table$EquivDose/toxcast.table$seem3.u95, 3)

Reorder your table by BER:

toxcast.table <- toxcast.table[order(toxcast.table$BER),]
                                 caption = "Bioactivity:Exposure Ratios",

Make a BER plot

Convert the chemical names into factors for plotting:

toxcast.table$Compound <- factor(toxcast.table$Compound,

Then we can create a plot where the BER for each chemical is shown by the margin between putative hazard (in red) and estimated exposure (in blue). The wider the margin, the lesser the risk. Chemicals are arranged from left-to-right by increasing margin (that is, decreasing risk).

BER.plot <- ggplot(data=toxcast.table) +
               alpha=0.5) + 
               alpha=0.5) +   
  geom_point(aes(x=Compound,y=(10^Med.AC50)/Css),shape=3,color="red") +
  geom_point(aes(x=Compound,y=seem3),shape=3,color="blue") +
  theme_bw() +
  theme(axis.title.x = element_text(size=8)) + 
  theme(axis.title.y = element_text(size=8)) + 
  scale_y_log10() + 
  theme(axis.text.x = element_text(
    face = "bold", 
    hjust = 1, 
    vjust = 1, 
    size = 4, 
    angle = 45)) +
  ylab("Bioactive Dose & Exposure\nmg/kg BW/day") + 
  xlab("Chemicals Ranked By BER")

Try the httk package in your browser

Any scripts or data that you put into this service are public.

httk documentation built on March 7, 2023, 7:26 p.m.