Apparent Membrane Permeability (Caco-2)"
In invitroTKstats: In Vitro Toxicokinetic Data Processing and Analysis Pipeline

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = 'center'
)

Introduction

This vignette guides users on how to estimate apparent membrane permeability (P~app~) from mass spectrometry data. Apparent membrane permeability is a chemical specific parameter that describes the general transport of the chemical through a membrane (@honda2025impact, @hubatsch2007determination).

The mass spectrometry data should be collected from a Caco-2 assay in which the test compound was added to either the apical or basolateral side of a confluent and differentiated Caco-2 cell monolayer as seen in Figure 1 (@honda2025impact, @hubatsch2007determination).

knitr::include_graphics("img/Caco2_diagram.png")

Suggested packages for use with this vignette

# Primary package 
library(invitroTKstats)
# Data manipulation package 
library(dplyr)
# Table formatting package
library(flextable)

Load Data

First, we load in the example dataset from invitroTKstats.

# Load example caco2 data 
data("Caco2-example")

Five datasets are loaded in: caco2_cheminfo, caco2_L0, caco2_L1, caco2_L2, and caco2_L3. The first, caco2_cheminfo, contains chemical information necessary for identification mapping; it is used to create Level 0 data. The latter datasets contain Caco-2 data at Level 0, 1, 2, and 3 respectively. For the purpose of this vignette, we'll start with caco2_L0, the Level 0 data, to demonstrate the complete pipelining process.

caco2_L0 is the output from the merge_level0 function which compiles raw lab data from specified Excel files into a singular data frame. The data frame contains exactly one row per sample with information obtained from the mass spectrometer. For more details on curating raw lab data to a singular Level 0 data frame, see the "Data Guide Creation and Level-0 Data Compilation" vignette.

The following table displays the first three rows of caco2_L0, our Level 0 data.

head(caco2_L0, n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(
      )
      )
  ) %>% 
  set_caption(caption = "Table 1: Level 0 data",
              align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

Level 1 processing

format_caco2 is the Level 1 function used to create a standardized data frame. This level of processing is necessary because naming conventions or formatting can differ across data sets.

If the Level 0 data already contains the required column, then the existing column name can be specified. For example, caco2_L0 already contains a column specifying the sample name called "Sample". However, the default column name for sample name is "Lab.Sample.Name". Therefore, we specify the correct column with sample.col = "Sample". In general, to specify an already existing column that differs from the default, the user must use the parameter with the .col suffix.

If the Level 0 data does not already contain the required column, then the entire column can be populated with a single value. For example, caco2_L0 does not contain a column specifying biological replicates. Therefore, we populate the required column with biological.replicates = 1. In general, to specify a single value for an entire column, the user must use the parameter without the .col suffix.

Users should be mindful if they choose to specify a single value for all of their samples; they should verify this action is one they wish to take.

Some columns must be present in the Level 0 data while others can be filled with a single value. At minimum, the following columns must be present in the Level 0 data and specification with a single entry is not permitted: sample.col, lab.compound.col, dtxsid.col, compound.col, area.col, istd.col, type.col, direction.col, receiver.vol.col, and donor.vol.col.

If there is no additional note.col in the Level 0 data, users should use note.col = NULL to fill the column with "Note".

The rest of the following columns may either be specified from the Level 0 data or filled with a single value: date.col or date, membrane.area.col or membrane.area, test.conc.col or test.conc, cal.col or cal, dilution.col or dilution, time.col or time, istd.name.col or istd.name, istd.conc.col or istd.conc, test.nominal.conc.col or test.nominal.conc, biological.replicates.col or biological.replicates, technical.replicates.col or technical.replicates, analysis.method.col or analysis.method, analysis.instrument.col or analysis.instrument, analysis.parameters.col or analysis.parameters, level0.file.col or level0.file, and level0.sheet.col or level0.sheet.

# Create a table of required arguments for Level 1

req_cols <- data.frame(matrix(nrow = 34, ncol = 5))
vars <- c("Argument", "Default", "Required in L0?", "Corresp. single-entry Argument", "Descr.")
colnames(req_cols) <- vars

# Argument names 
arguments <- c("FILENAME", "data.in", "sample.col", "lab.compound.col", "dtxsid.col", "date.col", 
                "compound.col", "area.col", "istd.col", "type.col", "direction.col", "membrane.area.col",
                "receiver.vol.col", "donor.vol.col", "test.conc.col", "cal.col", "dilution.col",
                "time.col", "istd.name.col", "istd.conc.col", "test.nominal.conc.col", "biological.replicates.col",
                "technical.replicates.col", "analysis.method.col", "analysis.instrument.col", "analysis.parameters.col",
                "note.col", "level0.file.col", "level0.sheet.col", "output.res", "save.bad.types", "sig.figs", "INPUT.DIR", "OUTPUT.DIR")
req_cols[, "Argument"] <- arguments

# Default arguments 
defaults <- c("MYDATA", NA, "Lab.Sample.Name", "Lab.Compound.Name",
              "DTXSID", "Date", "Compound.Name", "Area", 
              "ISTD.Area", "Type", "Direction", "Membrane.Area",
              "Vol.Receiver", "Vol.Donor", "Test.Compound.Conc", 
              "Cal", "Dilution.Factor", "Time", "ISTD.Name", 
              "ISTD.Conc", "Test.Target.Conc",
              "Biological.Replicates", "Technical.Replicates",
              "Analysis.Method", "Analysis.Instrument", 
              "Analysis.Parameters", "Note", "Level0.File",
              "Level0.Sheet", "FALSE", "FALSE", "5", "NULL", "NULL")
req_cols[, "Default"] <- defaults 

# Argument required in L0?
req_cols <- req_cols %>% 
  mutate("Required in L0?" = case_when(
    Argument %in% c("sample.col", "lab.compound.col", "dtxsid.col", "compound.col", "area.col", "istd.col", 
  "type.col", "direction.col", "receiver.vol.col", "donor.vol.col") ~ "Y",
  Argument %in% c("FILENAME", "data.in", "output.res", "save.bad.types", "sig.figs", "INPUT.DIR", "OUTPUT.DIR") ~ "N/A",
  .default = "N"
  ))

# Corresponding single-entry Argument 
req_cols <- req_cols %>% 
  mutate("Corresp. single-entry Argument" = ifelse(.data[[vars[[3]]]] == "N" & .data[[vars[[1]]]] != "note.col", 
                                                    gsub(".col" ,"", Argument), NA)) 

# Brief description 
description <- c("Output and input filename",
                 "Level 0 data frame", 
                 "Lab sample name", 
                 "Lab test compound name (abbr.)",
                 "EPA's DSSTox Structure ID",
                 "Lab measurement date",
                 "Formal test compound name",
                 "Target analyte peak area",
                 "Internal standard peak area",
                 "Sample type (Blank/D0/D2/R2)",
                 "Experiment direction",
                 "Membrane area",
                 "Receiver compartment volume",
                 "Donor compartment volume",
                 "Standard test chemical concentration",
                 "MS calibration",
                 "Number of times sample was diluted",
                 "Time before compartments were measured",
                 "Internal standard name",
                 "Internal standard concentration",
                 "Expected initial chemical concentration added to donor",
                 "Replicates with the same analyte",
                 "Repeated measurements from one sample",
                 "Analytical chemistry analysis method",
                 "Analytical chemistry analysis instrument",
                 "Analytical chemistry analysis parameters",
                 "Additional notes",
                 "Raw data filename",
                 "Raw data sheet name",
                 "Export results (TSV)?",
                 "Export bad data (TSV)?",
                 "Number of significant figures",
                 "Input directory of Level 0 file",
                 "Export directory to save Level 1 files")
req_cols[, "Descr."] <- description

req_cols %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list(height = 200)
      )
  ) %>% 
  set_caption("Table 2: Level 1 'format_caco2' function arguments", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

A TSV file containing the level-1 data can be exported to the user's per-session temporary directory. This temporary directory is a per-session directory whose path can be found with the following code: tempdir(). For more details, see [https://www.collinberke.com/til/posts/2023-10-24-temp-directories/].

To avoid exporting to this temporary directory, an OUTPUT.DIR must be specified. We have omitted this export entirely with output.res = FALSE (the default). The option to omit exporting a TSV file is also available at levels 2 and 3 and will be used from this point forward.

caco2_L1_curated <- format_caco2(FILENAME = "Caco2_vignette",
                                 data.in = caco2_L0,
                                 # columns present in L0 data 
                                 sample.col = "Sample",
                                 lab.compound.col = "Lab.Compound.ID",
                                 compound.col = "Compound",
                                 area.col = "Peak.Area",
                                 istd.col = "ISTD.Peak.Area",
                                 test.conc.col = "Compound.Conc",
                                 analysis.method.col = "Analysis.Params",
                                 # columns not present in L0 data
                                 biological.replicates = 1,
                                 technical.replicates = 1,
                                 membrane.area = 0.11,
                                 cal = 1,
                                 time = 2, 
                                 istd.conc = 1,
                                 test.nominal.conc = 10,
                                 analysis.instrument = "Agilent.GCMS",
                                 analysis.parameters = "Unknown",
                                 note.col = NULL,
                                 # don't export output TSV file
                                 output.res = FALSE
                                 )

We receive a message that "Responses of samples with a "Blank" sample type and a NA response have been reassigned to 0." Currently, when the ISTD.Area measurement for a sample is 0, the calculated Response is NaN, or not a number, because of the division by 0. In this particular case, all of our samples that have a NaN Response are "Blank" sample types. But, these "Blank" sample types are needed for point estimation. (Note: NA responses of other sample types are appropriately removed.) Users can verify this with the following check.

all(caco2_L1_curated[caco2_L1_curated$ISTD.Area == 0, "Sample.Type"] == "Blank")

Therefore, the NaNs are replaced with 0 so that the "Blank" sample types are kept in the dataset and our estimates are accurate. Users can verify that all "Blank" samples with ISTD.Area = 0 have their Response values reassigned to 0 with the following check.

# Verify Blank samples with ISTD.Area = 0 also have Response = 0
resp <- caco2_L1_curated %>% 
  dplyr::filter(Sample.Type == "Blank" & ISTD.Area == 0) %>% 
  dplyr::select(Response) %>% 
  unlist()

all(resp == 0)

Now, all of our samples are successfully formatted and returned in caco2_L1_curated, our Level 1 data produced from format_caco2. Each sample has one of the following sample types indicated in bold.

Blank with no chemical added (Blank)
Target concentration added to donor compartment at time 0 (can be referred to as C0) (D0)
Donor compartment at the end of the experiment (D2)
Receiver compartment at the end of the experiment (R2)

If any samples had a different sample type, they would have been removed and reported to the user. If the user wants to export the removed samples as a TSV, the user should set the parameter save.bad.types = TRUE.

The following table displays the first three samples of caco2_L1_curated with a non-Blank sample type. In addition to the columns specified by the user, there is an additional column called Response. This column is the test compound concentration and is calculated as $\textrm{Response} = \frac{\textrm{Analyte Area}}{\textrm{ISTD Area}}*\textrm{ISTD Conc}$ where $\textrm{Analyte Area}$ is defined by the Area column, $\textrm{ISTD Area}$ is defined by the ISTD.Area column, and $\textrm{ISTD Conc}$ is defined by the ISTD.Conc column.

# Select non-Blank sample type to display Response col 
caco2_L1_curated %>% 
  filter(!Sample.Type %in% c("Blank", "CC")) %>% 
head(n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 3: Level 1 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

Level 2 processing

sample_verification is the Level 2 function used to add a verification column. The verification column indicates whether a sample should be included in the point estimation (Level 3) processing. This column allows users to keep all samples in their data but only utilize the reliable samples for P~app~ estimation. All of the data in Level 2 is identical to the data in Level 1 with the exception of the additional Verified column.

To determine whether a sample should be included, the user should consult the wet-lab scientists from where their data originates or a chemist who may be able to provide reliable rationale for samples that should not be verified. This level of processing allows the user to receive feedback, exclude erroneous or unreliable samples, and produce new P~app~ estimates. Thus, there is always an open channel of communication between the user and the wet-lab scientists or chemists.

We will use the already processed Level 2 data frame, caco2_L2, to regenerate our exclusion data. Note, all of our samples are verified but we are explaining how to create an exclusion list for learning purposes. In general, the user would not have access to the exclusion information a priori.

The exclusion data frame must include the following columns: Variables, Values, and Message. The Variables column contains the variable names used to filter the excluded rows. Here, we are using Lab.Sample.Name and DTXSID to identify the excluded rows separated by a "|". The Values column contains the values of the variables, as a character, also separated by a "|". The Message column contains the reason for exclusion. Here, we are using the reasons listed in the Verified column in caco2_L2. The user should refrain from using "|" in any of their descriptions to avoid conflicts with the sample_verification function.

# Use verification data from loaded in `caco2_L2` data frame 
exclusion <- caco2_L2 %>% 
  filter(Verified != "Y") %>% 
  mutate("Variables" = "Lab.Sample.Name|DTXSID") %>% 
  mutate("Values" = paste(Lab.Sample.Name, DTXSID, sep = "|")) %>% 
  mutate("Message" = Verified) %>% 
  select(Variables, Values, Message)

exclusion %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 4: Exclusion data frame", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

As expected, our exclusion data frame is empty because all of our samples are verified. If all of the user's samples are verified, they simply do not provide an exclusion.info data frame in sample_verification.

caco2_L2_curated <- sample_verification(FILENAME = "Caco2_vignette",
                                        data.in = caco2_L1_curated,
                                        assay = "Caco-2",
                                        # don't export output TSV file
                                        output.res = FALSE)

Our Level 2 data now contains a Verified column. If the sample should be included, the column contains a "Y" for yes. If the sample should be excluded, the column contains the reason for exclusion.

The following table displays some rows of the Level 2 data.

caco2_L2_curated %>% 
  head(n = 3) %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 5: Level 2 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

Level 3 processing

calc_caco2_point is the Level 3 function used to calculate the P~app~ point estimate for each test compound using a Frequentist framework.

Mathematically, P~app~ is the amount of compound transported per unit time and is defined as $$P_{app} = \frac{dQ/dt}{c_0*A}$$ where $dQ/dt$ is the rate of permeation, $c_0$ is the initial concentration in the donor compartment, and $A$ is the surface area of the cell monolayer.

First, we define the rate of permeation as the amount of compound passing through the monolayer per unit time. It is expressed as $$\frac{dQ}{dt} = \frac{c_{\textrm{receiver}}V_{\textrm{receiver}}}{\Delta t}$$ where $c_{\textrm{receiver}}$ is the concentration of the test compound in the receiver compartment, $V_{\textrm{receiver}}$ is the volume of the receiver compartment, and $\Delta t$ is the total elapsed time. This rate is a type of flux and therefore has units of $\mu \textrm{mol}/s$. The dimensional analysis is described below. $$\left[\frac{dQ}{dt}\right] = \frac{[c_{\textrm{receiver}}][V_{\textrm{receiver}}]}{[\Delta t]} = \frac{[\mu \textrm{mol/L}][\textrm{L]}}{[s]} = \frac{[\mu \textrm{mol}]}{[s]} $$

Next, to estimate $c_\textrm{receiver}$, we first multiply each R2 response with the corresponding R2 dilution factor and estimate the mean. To account for background noise, we follow a similar procedure with the blank response and dilution factor. We then subtract the two such that $$c_\textrm{receiver} = \textrm{mean}(\textrm{dilution factor}{R2} * \textrm{Response}{R2}) - \textrm{mean}(\textrm{dilution factor}{Blank} * \textrm{Response}{Blank})$$

The initial concentration in the donor compartment, $c_0$, is calculated similarly but with the D0 response instead of the R2 response.

The last remaining arguments are estimated with columns in our data: $V_\textrm{receiver}$ is defined by our Vol.Receiver column, $\Delta t$ is defined by our Time column, and $A$ is defined by our Membrane.Area column.

With this information, we calculate P~app~ in the donor to receiver direction. The estimate's direction is annotated according to Figure 1. For example, results annotated with the "A2B" suffix denote the apical side as the donor and the basolateral side as the receiver. Contrastingly, results annotated with the "B2A" suffix denote the basolateral side as the donor and the apical side as the receiver.

caco2_L3_curated <- calc_caco2_point(FILENAME = "Caco2_vignette",
                                     data.in = caco2_L2_curated, 
                                     # don't export output TSV file 
                                     output.res = FALSE
                                     )

caco2_L3_curated %>% 
  flextable() %>% 
  bg(bg = "#DDDDDD", part = "header") %>% 
  autofit() %>% 
  set_table_properties(
    opts_html = list(
      scroll = list()
      )
  ) %>% 
  set_caption("Table 6: Level 3 data", align_with_table = FALSE) %>% 
  fontsize(size = 10, part = "body") %>% 
  theme_vanilla()

In addition to returning an estimate for Papp and its components (C0, dQdt) in both directions, our Level 3 data contains estimates for the efflux ratio (Refflux), the fraction recovered (Frec) in both directions, and the qualitative category of each fraction recovered value (Recovery_Class).

The efflux ratio is the ratio between the apparent membrane permeabilities and is expressed as $$R_{\textrm{efflux}} = \frac{P_{app}^{B2A}}{P_{app}^{A2B}}$$

The fraction recovered is the fraction of the initial donor amount recovered in the receiver compartment. It assesses the accuracy of the analytical chemistry method used to determine concentration values. It is expressed as $$ \frac{V_{D2}\textrm{DF}_{D2}(Responses_{D2} - \overline{Response}{Blank}) + V{R2}\textrm{DF}_{R2}(Responses_{R2} - \overline{Response}{Blank})}{V{D0}\textrm{DF}_{D0}(Responses_{D0} - \overline{Response}{Blank})} $$ where $V$ is the corresponding volume, $\textrm{DF}$ is the corresponding dilution factor, $Responses$ are the corresponding responses, and $\overline{Response}{Blank}$ is the mean blank response.

Fraction recovered values are then classified and provided as the Recovery_Class value. Values $<0.4$ are classified as "Low Recovery" and values $>2$ are classified as "High Recovery". Values within the range, $0.4 \leq \textrm{Frec} \leq 2$, are considered normal and are not given an explicit classification.

If there are multiple samples per sample type per chemical, (i.e. biological replicates), a vector of fraction recovered values will be returned with a "|" separating each value. Similarly, a vector of recovery classifications will be returned for each value in the vector of fraction recovered values with a "|" separating each value.

The following columns are always returned involving the fraction recovered in the A to B direction:

Frec_A2B.vec - vector of fraction recovered values
Recovery_Class_A2B.vec - vector of recovery classifications
Frec_A2B.mean - mean of fraction recovered values
Recovery_Class_A2B.mean - recovery classification of mean fraction recovered value

Note that even if there are no biological replicates, the above four columns are still returned; all of the Frec estimates will just be equivalent. Estimates are also returned in the B to A direction.

Looking at our results, some of our Frec_A2B.vec and Frec_B2A.vec values are less than $0.4$ and are therefore classified as "Low Recovery". Others are within the $[0.4, 2]$ range and are therefore considered normal.

Best Practices: Food for thought

Generally, data processing pipelines should include minimal to no manual coding. It is best to keep clean code that is easily reproducible and transferable. The user should aim to have all the required data and meta-data files properly formatted to avoid further modifications throughout the pipeline.