Resampled Performance

Algorithms

Model performance can be estimated with resampling methods that simulate repeated training and test set fits and predictions. With these methods, performance metrics are computed on each resample to produce an empirical distribution for inference. Resampling is controlled in the MachineShop with the functions:

r rdoc_url("BootControl()", "MLControl") : simple bootstrap resampling [@efron:1993:IB] to repeatedly fit a model to bootstrap resampled training sets and predict the full dataset.

r rdoc_url("BootOptimismControl()", "MLControl") : optimism-corrected bootstrap resampling [@efron:1983:LLB; @harrell:1996:MPM].

r rdoc_url("CVControl()", "MLControl") : repeated K-fold cross-validation [@kohavi:1995:SCB] to repeatedly partition the full dataset into K folds. For a given partitioning, prediction is performed on each of the K folds with models fit on all remaining folds. The package default is 10-fold cross-validation.

r rdoc_url("CVOptimismControl()", "MLControl") : optimism-corrected cross-validation [@davison:1997:BMA, eq. 6.48].

r rdoc_url("OOBControl()", "MLControl") : out-of-bootstrap resampling to repeatedly fit a model to bootstrap resampled training sets and predict the unsampled cases.

r rdoc_url("SplitControl()", "MLControl") : split training and test sets [@hastie:2009:ESL7] to randomly partition the full dataset for model fitting and prediction, respectively.

r rdoc_url("TrainControl()", "MLControl") : training resubstitution for both model fitting and prediction on the full dataset in order to estimate training, or apparent, error [@efron:1986:HBA].

For the survival example, repeated cross-validation control structures are defined for the estimation of model performance in predicting survival means and 5 and 10-year survival probabilities. In addition to arguments controlling the resampling methods, a seed can be set to ensure reproducibility of resampling results obtained with the structures. The default control structure can also be set globally by users and subsequently changed back to its package default as desired.

## Control structures for K-fold cross-validation

## Prediction of survival means
surv_means_control <- CVControl(folds = 5, repeats = 3, seed = 123)

## Prediction of survival probabilities
surv_probs_control <- CVControl(folds = 5, repeats = 3, seed = 123) %>%
  set_predict(times = surv_times)

## User-specification of the default control structure
MachineShop::settings(control = CVControl(folds = 5, seed = 123))

## Package default
# MachineShop::settings(reset = "control")

Parallel Processing

Resampling and permutation-based variable importance are implemented with the foreach package [@microsoft:2019:FPF] and will run in parallel if a compatible backend is loaded, such as that provided by the doParallel [@microsoft:2019:DFP] or doSNOW package [@microsoft:2019:DFS].

## Register multiple cores for parallel computations
library(doParallel)
registerDoParallel(cores = 2)

Resample Function

Resampling is performed by calling the r rdoc_url("resample()") function with a variable specification, model, and control structure. Like the r rdoc_url("fit()") function, variables may be specified in terms of a traditional formula, design matrix, model frame, or recipe.

## Resample estimation for survival means and probabilities
(res_means <- resample(surv_fo, data = surv_train, model = GBMModel,
                       control = surv_means_control))

(res_probs <- resample(surv_fo, data = surv_train, model = GBMModel,
                       control = surv_probs_control))

Summary Statistics

The r rdoc_url("summary()") function when applied directly to output from r rdoc_url("resample()") computes summary statistics for the default performance metrics described in the Performance Function section.

## Summary of survival means metric
summary(res_means)

## Summary of survival probability metrics
summary(res_probs)

Other relevant metrics can be identified with r rdoc_url("metricinfo()") and summarized with r rdoc_url("performance()").

## Resample-specific metrics
metricinfo(res_means) %>% names

## User-specified survival means metrics
summary(performance(res_means, metrics = c(cindex, rmse)))

Furthermore, summaries can be customized with a user-defined statistics function or list of statistics functions passed to the stats argument of r rdoc_url("summary()").

## User-defined statistics function
percentiles <- function(x) quantile(x, probs = c(0.25, 0.50, 0.75))
summary(res_means, stats = percentiles)

## User-defined list of statistics functions
summary(res_means, stats = c(Mean = mean, Percentile = percentiles))

Plots

Summary plots of resample output can be obtained with the r rdoc_url("plot()") function. Boxplots are the default plot type; but density, errorbar, and violin plots are also available. Plots are generated with the ggplot2 package [@wickham:2016:GEG] and returned as ggplot objects. As such, annotation and formatting defined for ggplots can be applied to the returned plots.

## Libraries for plot annotation and fomatting
library(ggplot2)
library(gridExtra)

## Individual ggplots
p1 <- plot(res_means)
p2 <- plot(res_means, type = "density")
p3 <- plot(res_means, type = "errorbar")
p4 <- plot(res_means, type = "violin")

## Grid of plots
grid.arrange(p1, p2, p3, p4, nrow = 2)

Stratified Resampling

Stratification of cases for the construction of resampled training and test sets can be employed to help achieve balance across the sets. Stratified resampling is automatically performed if variable specification is in terms of a traditional formula or design matrix and will be done according to the response variable. For model frames and recipes, stratification variables must be defined explicitly with the strata argument to the r rdoc_url("ModelFrame()") constructor or with the r rdoc_url("role_case()", "recipe_roles") function. In general, strata are constructed from numeric proportions for BinomialVariate; original values for character, factor, logical, and ordered; first columns of values for matrix; original values for numeric; and numeric times within event statuses for Surv. Numeric values are stratified into quantile bins and categorical values into factor levels defined by the resampling control functions. Missing values are replaced with non-missing values sampled at random with replacement.

## Model frame with response variable stratification
mf <- ModelFrame(surv_fo, data = surv_train, strata = surv_train$y)
resample(mf, model = GBMModel)

## Recipe with response variable stratification
rec <- recipe(y ~ ., data = surv_train) %>%
  role_case(stratum = y)
resample(rec, model = GBMModel)

Dynamic Model Parameters

As discussed previously in the Model Fit and Prediction section, dynamic model parameters are evaluated at the time of model fitting and can depend on the number of observations in the fitted dataset. In the context of resampling, dynamic parameters are repeatedly evaluated at each fit of the resampled datasets. As such, their values can change based on the observations selected for training at each iteration of the resampling algorithm.

## Dynamic model parameter k = log number of training set observations
resample(surv_fo, data = surv_train, model = CoxStepAICModel(k = .(log(nobs))))

Model Comparisons

Resampled metrics from different models can be combined for comparison with the r rdoc_url("c()", "combine") function. Optional names given on the left hand side of equal operators within r rdoc_url("c()", "combine") calls will be used as labels in output from the r rdoc_url("summary()") and r rdoc_url("plot()") functions. For comparisons of resampled output, the same control structure must be used in all associated calls to r rdoc_url("resample()") to ensure that resulting model metrics are computed on the same resampled training and test sets. The combined resample output can be summarized and plotted as usual.

## Resample estimation
res1 <- resample(surv_fo, data = surv_train, model = GBMModel(n.trees = 25),
                 control = surv_means_control)
res2 <- resample(surv_fo, data = surv_train, model = GBMModel(n.trees = 50),
                 control = surv_means_control)
res3 <- resample(surv_fo, data = surv_train, model = GBMModel(n.trees = 100),
                 control = surv_means_control)

## Combine resample output for comparison
(res <- c(GBM1 = res1, GBM2 = res2, GBM3 = res3))

summary(res)

plot(res)

Pairwise model differences for each metric can be calculated with the r rdoc_url("diff()") function applied to results from a call to r rdoc_url("c()", "combine"). Resulting differences can be summarized descriptively with the r rdoc_url("summary()") and r rdoc_url("plot()") functions and assessed for statistical significance with pairwise t-tests performed by the r rdoc_url("t.test()") function. The t-test statistic for a given set of $R$ resampled differences is calculated as $$ t = \frac{\bar{x}R}{\sqrt{F s^2_R / R}}, $$ where $\bar{x}_R$ and $s^2_R$ are the sample mean and variance. Statistical testing for a mean difference is then performed by comparing $t$ to a $t{R-1}$ null distribution. The sample variance in the t statistic is known to underestimate the true variances of cross-validation mean estimators. Underestimation of these variances will lead to increased probabilities of false-positive statistical conclusions. Thus, an additional factor $F$ is included in the t statistic to allow for variance corrections. A correction of $F = 1 + K / (K - 1)$ was found by Nadeau and Bengio [-@nadeau:2003:IGE] to be a good choice for cross-validation with $K$ folds and is thus used for that resampling method. The extension of this correction by Bouchaert and Frank [-@bouckaert:2004:ERS] to $F = 1 + T K / (K - 1)$ is used for cross-validation with $K$ folds repeated $T$ times. For other resampling methods $F = 1$. Below are t-test results based on the extended correction factor for 3 repeats of 5-fold cross-validation.

## Pairwise model comparisons
(res_diff <- diff(res))

summary(res_diff)

plot(res_diff)
t.test(res_diff)


brian-j-smith/MachineShop documentation built on Sept. 22, 2023, 10:01 p.m.