vtreat grouping example

knitr::opts_chunk$set(fig.width = 7)
library(vtreat)
set.seed(23255)

have_rqdatatable = requireNamespace("rqdatatable", quietly=TRUE)
if(have_rqdatatable) {
  library(rqdatatable)
}
#
# takes the frame (d) and the outcome column (d$conc)
# from the global environment
#
showGroupingBehavior = function(groupcol, title) {
  print(title)

  # display means of each group
  print("Group means:")
  means = tapply(d$conc, d[[groupcol]], mean)
  print(means)
  print(paste("Standard deviation of group means:", sd(means)))
}

This vignette shows an example use of y-stratified sampling with a grouping restriction in vtreat.

For this example, we will use the Theosph dataset: data from an experiment on the pharmacokinetics of theophylline. We will demonstrate the desired effects of y-stratification while also respecting a grouping constraint.

The Data

First, let's look at the data.

# panel data for concentration in multiple subjects 
d <- datasets::Theoph
head(d)
summary(d)

We have twelve subjects, who each received a dose of the anti-asthma drug theophylline. The theophylline concentration in the patients' blood was then measured at eleven points during the next 25 hours. Most of the patients got about the same dose, although the dose information reported in the dataset is normalized by weight.

Partitioning the Data for Modeling

Suppose we wanted to fit a model to analyze how a patient's weight affects how theophylline is metabolized, and validate that model with three-fold cross-validation. It would be important that all readings from a given patient stay in the same fold. We might also want the population in each fold to have similar distributions of theophylline concentrations curves.

Recall that the goal of y-stratification is to insure that all samples from the data have as close to identical y distributions as possible. This becomes more difficult when we also have to obey a grouping constraint.

Let's look at three ways of splitting the data into folds. First, we will split the data arbitrarily into three groups, using the modulo of the Subject id to do the splitting.

# a somewhat arbitrary split of patients
subnum = as.numeric(as.character(d$Subject))
d$modSplit = as.factor(subnum %% 3)

We can verify that this split preserves groups, by looking at the table of subject observations in each fold. Each subject should only appear in a single fold.

print(table(Subject=d$Subject, groupid=d$modSplit))

Now let's try the standard y stratification in vtreat.

# stratify by outcome only
# forces concentration to be equivalent
pStrat <- kWayStratifiedY(nrow(d),3,d,d$conc)
attr(pStrat, "splitmethod")
d$stratSplit <- vtreat::getSplitPlanAppLabels(nrow(d),pStrat)

print(table(Subject=d$Subject, groupid=d$stratSplit))

We can see this partition didn't preserve the Subject grouping.

Finally, we can try vtreat's group-preserving split, which also tries to y-stratify as much as possible (by stratifying on the mean y observation from each group).

# stratify by patient and outcome
# allows concentration to vary amoung individual patients
splitter <- makekWayCrossValidationGroupedByColumn('Subject')
split <- splitter(nrow(d),3,d,d$conc)
attr(split, "splitmethod")
d$subjectSplit <- vtreat::getSplitPlanAppLabels(nrow(d),split)

print(table(Subject=d$Subject, groupid=d$subjectSplit))

This is again a subject-preserving partition.

We can compare the mean theophylline concentration and the average pharmacokinetic profile for each fold, for both of the subject-preserving partitions. We see that the stratification reduces some of the variation between folds.

Arbitrary Partition

showGroupingBehavior("modSplit", "Arbitrary grouping")

Group-preserving, y-stratified Partition

showGroupingBehavior("subjectSplit", "Group by patient, stratify on y")


Try the vtreat package in your browser

Any scripts or data that you put into this service are public.

vtreat documentation built on Aug. 20, 2023, 1:08 a.m.