In ronkeizer/vpc: Create Visual Predictive Checks

The VPC is the most widely used diagnostic tool in pharmacometrics (see e.g. here), and is commonly created using PsN and Xpose, using NONMEM as the simulation engine. The aim of the current R library is to provide an improved tool that is:

a single-step process for creating a VPC in R, without the need for PsN or Xpose.
allows changing of vpc parameters such as binning and stratification upon creation of the plot, not in a separate pre-processing step.
easier debugging than PsN+Xpose, all data parsing and plotting in one R package
more flexible regarding input (use simulated data from NONMEM, Monolix, Stan, R::PKPDsim, or any other simulation tool)
easier to customize, e.g. request any prediction / confidence interval or binning strategy upon plotting.
easier to extend: the output is a ggplot object which can be themed and extended

Installation

library(devtools)
install_github("ronkeizer/vpc")
library(vpc)

    eval_all <- TRUE
    # library("devtools")
    # install_github("ronkeizer/vpc")

How to use

Functions:

vpc: VPC for continuous data
vpc_cat: VPC for categorical data
vpc_cens: VPC for censored continuous data (e.g. data < LOQ)
vpc_tte: VPC for (repeated) time-to-event data

The main arguments to these function are the sim and obs arguments, which specify a simulation dataset and an observation dataset. All other arguments can be used to customize data parsing and visual appearance of the VPC such as stratification and binning. All four functions will return a ggplot2 object.

Examples

Load the library and get the observation data and simulation data. In the first example, we'll use a dataset that's available in R by default (Theophylline) and generate the simulation dataset in R.

    library(dplyr)

    ## Load the theophylline PK dataset
    obs <- Theoph
    colnames(obs) <- c("id", "wt", "dose", "time", "dv")
    obs <- obs %>%
            dplyr::group_by(id) %>%  
            mutate(sex = round(runif(1))) # randomly assign a "sex" covariate
    sim <- sim_data(obs, # the design of the dataset
                    model = function(x) { # the model
                      pk_oral_1cmt (t = x$time, dose=x$dose * x$wt, ka = x$ka, ke = x$ke, cl = x$cl * x$wt)
                    },
                    error = list(additive = 0.1),
                    theta = c(2.774, 0.0718, .0361),                 # parameter values
                    omega_mat = c(0.08854,                           # specified as lower triangle by default;
                                  0.02421, 0.02241,                  # note: assumed that every theta has iiv, set to 0 if no iiv.
                                  0.008069, 0.008639, 0.02862),      
                    par_names = c("ka", "ke", "cl"),                 # link the parameters in the model to the thetas/omegas
                    n = 500)

However, instead we could use observation and simulation data from NONMEM, e.g. (not run):

obs <- read_table_nm("sdtab1")   # an output table with at least ID, TIME, DV
sim <- read_table_nm("simtab1")  # a simulation file with at least ID, TIME, DV

The read_table_nm() function comes with the vpc library and is a fast way to read in output data created from the $TABLE record in NONMEM, including tables with multiple subproblems.

Next, the VPC can simply be created using:

vpc (sim = sim, obs = obs)

Stratified by SEX:

vpc (sim = sim, obs = obs, stratify = c("sex"), facet = "rows")

The vpc functions in this packages allow stratification by 2 variables (horizontal and vertical faceting). If more strata are desired, a new combined variable should be defined that defines the factors.

Note: The output from the vpc() functions are ggplot2-objects, and they can easily be extended and themed using the well-known functions from the ggplot2 package. One cautionary note however is that you should refrain from applying any additional faceting functions after the VPC has been created. In other words, faceting should be done using the vpc() functions (stratify=... argument). Otherwise you will end up with funky (in the best case) or misleading (in the worst case) plots.

Pred-correction, and plotting of data:

vpc (sim = sim, obs = obs, pred_corr = TRUE, show=list(obs_dv = TRUE, obs_ci = TRUE))

With more explicit use of options, and saving the object:

vpc(sim = sim,
           obs = obs,                                   # supply simulation and observation dataframes
           obs_cols = list(
             dv = "dv",                               # these column names are the default,
             idv = "time"),                            #   update these if different.
           sim_cols = list(
             dv = "sdv",
             idv = "time"),
           bins = c(0, 2, 4, 6, 8, 10, 16, 25),             # specify bin separators manually
           stratify = c("sex"),                         # multiple stratifications possible, just supply as vector
           pi = c(0.05, 0.95),                          # prediction interval simulated data to show
           ci = c(0.05, 0.95),                          # confidence intervals to show
           pred_corr = FALSE,                           # perform prediction-correction?
           show = list(obs_dv = TRUE),                              # plot observations?
           facet = "rows",                              # wrap stratifications, or as "row" or "column"
           ylab = "Concentration",
            xlab = "Time (hrs)")

Note: If you imported the data from NONMEM, the VPC function will automatically detect column names from NONMEM, such as ID, TIME, DV. If you simulated data in R or got the data from a different software, you will probably have to change the variable names for the dependent and independent variable, and the individual index.

Using the show argument, you can instruct the vpc function what plot elements to show. The defaults are:

  obs_dv = FALSE        # the observed data
  obs_ci = TRUE         # the confidence interval of the observed data
  obs_median = TRUE     # the median of the observed data
  sim_median = FALSE    # the median of the simulated data
  sim_median_ci = TRUE  # the confidence interval around the median of the simulated data
  pi = FALSE            # the prediction interval quantiles
  pi_ci = TRUE          # the confidence interval around the prediction interval quantiles
  pi_as_area = FALSE    # show the PI as area instead of two lines
  bin_sep = TRUE        # show the bin separaters (as geom_rug)

For more information about these elements of the VPC, and e.g. the difference between confidence intervals and prediction intervals, and why it is usually important to show both, please have a look at e.g. this tutorial about VPCs.

Additionally, colors, fills, transparency (alpha), and linetypes and sizes can be changed easily using the vpc_theme argument and function. More general plot theming can be accomplished by supplying a ggplot2 theme to the ggplot_theme argument.

vpc(sim, obs,
    vpc_theme = new_vpc_theme (list(
       sim_pi_fill = "#aa6666", sim_pi_alpha = 0.15,
       sim_median_fill = "#66aa66", sim_median_alpha = 0.3,
       obs_ci_color = "red", obs_ci_linetype = 'solid',
       bin_separators_color = NA)),
    ggplot_theme = theme_empty
)

Binning

The vpc functions currently provide three options for binning (bins= argument):

time: Divide bins equally over time (or whatever independent variable is used). Recommended only when there is no observable clustering in the independent variable.
data: Divide bins equally over the amount of data ordered by independent variable. Recommended only when data are for nominal timepoints and all datapoints are available.
density: Divide bins based on data-density, i.e. place the bin-separators at nadirs in the density function. An approximate number of bins can be specified, but it is not certain that the algorithm will strictly use the specified number. More info in ?auto_bin().
jenks: Default and recommended method. Jenk's natural breaks optimization, similar to K-means clustering.
kmeans: K-means clustering.
pretty, quantile, hclust, sd, bclust, fisher. Methods provided by the classInt package, see the package help for more information.

Automatic determination of optimal bin number (as described e.g. by Lavielle et al. and Sonehag et al.) will be provided soon.

Censored data

The vpc_cens() function can be used to create a VPC for the probability of left- or right-censoring such as in the case of data ULOQ. There is no need to add a variable to the dataset to flag the censored cases, the function only requires the lloq or uloq. The example below artificially induces an LLOQ of 5 for the above model / dataset, and generates a VPC for the probability of left-censoring.

vpc_cens(sim, obs, lloq = 5)

Categorical data

VPCs can also be made for categorical data. These show the probability of observing a specific discrete outcome (or count), and plot this versus the simulated probability, shown as a confidence interval.

if(file.exists(paste0(system.file(package="vpc"), "/extdata/sdtab45"))) {
  obs_cat <- read_table_nm(file=paste0(system.file(package="vpc"), "/extdata/sdtab45"))
  sim_cat <- read_table_nm(file=paste0(system.file(package="vpc"), "/extdata/simtab45"))
  vpc_cat (sim = sim_cat, obs = obs_cat,
           obs_cols = list(dv = "SMXH"), sim_cols = list(dv = "SMXH"))
}

Time-to-event data

Similar to the VPC for continuous data, the VPC for TTE data requires simulated data. In general, there are two distinct approach to simulate survival data:

Hazard integration: Integrate the hazard over time, and at any possible observation timepoint randomly draw a binary value based on the probability of observing the event. The disadvantage of this method is that it is slow due to the numerical solving of the ODEs. Also, a dataset with a dense design grid has to be used for simulation, i.e. one that has observation times at every possible timepoint that an event can occur for all individuals.
Direct sampling: Sample event times directly from the distribution used to model the data (e.g. Weibull, exponential, Gompertz). Advantages of this approach is that it is much faster, and it does not require a dense grid. The disadvantage with this approach is however that the hazard is assumed constant over time, so models with time-dependent hazards cannot easily be simulated with this approach. This approach is straightforward in R but cannot easily be implemented in NONMEM. Example will follow soon.

An example for time-to-event data is shown below. The datasets are supplied with the vpc library.

data(rtte_obs_nm)
data(rtte_sim_nm)

Treat RTTE as TTE, no stratification:

vpc_tte(sim = rtte_sim_nm,          
        obs = rtte_obs_nm,
        rtte = FALSE,
        sim_cols = list(dv = "dv", idv="t"),
        obs_cols = list(idv = "dt"))

Stratified for covariate and study arm, and binned and smooth:

vpc_tte(sim = rtte_sim_nm,
        obs = rtte_obs_nm,
        stratify = c("sex","drug"),
        rtte = FALSE,
        bins = "kmeans",
        n_bins = 20,
        smooth = TRUE,
        sim_cols = list(dv = "dv", idv="t"),
        obs_cols = list(idv = "dt"))

Stratified for event number (RTTE) and study arm:

vpc_tte(sim = rtte_sim_nm,
        obs = rtte_obs_nm,
#        show = list(obs_cens = FALSE),
        rtte = TRUE, rtte_calc_diff = TRUE, events = c(1:3),
        stratify=c("drug"),
        bins = "time",
        n_bins = 20,
        smooth = TRUE,
        sim_cols = list(dv = "dv", idv="t"),
        obs_cols = list(idv = "t"), verbose=TRUE)

Kaplan-Meier Mean Covariate plots (KMMC)

vpc_tte(sim = rtte_sim_nm,
        obs = rtte_obs_nm,
        rtte = FALSE,
        bins=c(10, 20, 40, 60, 80, 100), # binning is recommended with KMMC
        kmmc = "sex",
        sim_cols = list(dv = "dv", idv="t"),
        obs_cols = list(idv = "dt"))