Model development often involves the comparison of multiple models from a candidate set for the purpose of selecting a final one. Models in the set may differ with respect to their predictor variables, preprocessing steps and parameters, and model types and parameters. Complex model selection strategies for sets that involve one or more of these differences can be implemented with the MachineShop package. Implementation is achieved with a straightforward syntax based on the meta-input and meta-model functions listed in the table below and with resampling, including nested resampling, conducted automatically for model selection and predictive performance evaluation.
| Parameter Grid Tuning | Candidate Set Selection | Ensemble Learning
|:-----------------------------|:--------------------------------|:------------------------------
| r rdoc_url("TunedInput()")
| r rdoc_url("SelectedInput()")
| r rdoc_url("StackedModel()")
| r rdoc_url("TunedModel()")
| r rdoc_url("SelectedModel()")
| r rdoc_url("SuperModel()")
These meta-functions fall into three main categories: 1) tuning of a given input or model over a grid of parameter values, 2) selection from an arbitrary set of different inputs or models, or 3) combining multiple models into an ensemble learner. In the context of these strategies, an input may be a formula, design matrix, model frame, or preprocessing recipe. The meta-input and meta-model functions themselves return input and model class objects, respectively. Combinations and multiple levels of nesting of meta-functions, inputs, and models are allowed. For example, r rdoc_url("StackedModel()")
and r rdoc_url("SuperModel()")
may consist of r rdoc_url("TunedModel")
and other model objects. r rdoc_url("SelectedModel()")
can select among mixes of r rdoc_url("TunedModel")
, ensemble model, and other model objects. Likewise, r rdoc_url("TunedInput")
objects, along with other inputs, may be nested within r rdoc_url("SelectedInput()")
. Furthermore, selection and tuning of both inputs and models can be performed simultaneously. These and other possibilities are illustrated in the following sections.
Inputs to model fitting functions define the predictor and response variables and the dataset containing their values. These can be specified with traditional formula and dataset pairs, design matrix and response variable pairs, model frames, and preprocessing recipes. The package supports (1) tuning of an input over a grid of parameter values and (2) selection of inputs from candidate sets that differ with respect to their predictors or their preprocessing steps and parameters.
Preprocessing recipes may have step with parameters that affect predictive performance. Steps can be tuned over a grid of parameter values with r rdoc_url("TunedInput()")
to select the best performing values. Calls to r rdoc_url("TunedInput()")
return an input object that may be trained on data with the r rdoc_url("fit()")
function or evaluated for predictive performance with r rdoc_url("resample()")
. As an example, a principal components analysis (PCA) step could be included in a preprocessing recipe for tuning over the number of components to retain in the final model. Such a recipe is shown below accompanied by a call to r rdoc_url("expand_steps()")
to construct a tuning grid. The grid parameter num_comp
and name PCA
correspond to the argument and id of the r rdoc_url("step_pca()")
function to which the values 1:3
apply. The recipe and grid may then be passed to r rdoc_url("TunedInput()")
for model fitting.
## Preprocessing recipe with PCA steps pca_rec <- recipe(y ~ ., data = surv_train) %>% role_case(stratum = y) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) %>% step_pca(all_predictors(), id = "PCA") ## Tuning grid of number of PCA components pca_grid <- expand_steps( PCA = list(num_comp = 1:3) ) ## Tuning specification tun_rec <- TunedInput(pca_rec, grid = pca_grid)
From the fit, the resulting model can be extracted with r rdoc_url("as.MLModel()")
. The output shows that one principal component was selected. Resample estimation of predictive performance is applied to a r rdoc_url("TunedInput")
specification for the selection. The default resampling method is cross-validation. Other methods, performance metrics, and selection statistics can be supplied to the r rdoc_url("TunedInput()")
arguments.
## Input-tuned model fit and final trained model model_fit <- fit(tun_rec, model = GBMModel) as.MLModel(model_fit) #> --- MLModel object ---------------------------------------------------------------------- #> #> Model name: GBMModel #> Label: Trained Generalized Boosted Regression #> Package: gbm #> Response types: factor, numeric, PoissonVariate, Surv #> Case weights support: TRUE #> Missing case removal: response #> Tuning grid: TRUE #> Variable importance: TRUE #> #> Parameters: #> List of 5 #> $ n.trees : num 100 #> $ interaction.depth: num 1 #> $ n.minobsinnode : num 10 #> $ shrinkage : num 0.1 #> $ bag.fraction : num 0.5 #> #> === $TrainingStep1 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> TunedModelRecipe log: #> # A tibble: 3 × 4 #> name selected params$PCA$num_comp metrics$`C-Index` #> <chr> <lgl> <int> <dbl> #> 1 ModelRecipe.1 TRUE 1 0.740 #> 2 ModelRecipe.2 FALSE 2 0.720 #> 3 ModelRecipe.3 FALSE 3 0.725 #> #> Selected row: 1 #> Metric: C-Index = 0.7399641
Selection of recipes with different steps or predictors can be conducted with r rdoc_url("SelectedInput()")
.
## Preprocessing recipe without PCA steps rec1 <- recipe(y ~ sex + age + year + thickness + ulcer, data = surv_train) %>% role_case(stratum = y) rec2 <- recipe(y ~ sex + age + year, data = surv_train) %>% role_case(stratum = y) ## Selection among recipes with and without PCA steps sel_rec <- SelectedInput( rec1, rec2, TunedInput(pca_rec, grid = pca_grid) )
In this case, the third recipe with PCA steps is selected.
## Input-selected model fit and model model_fit <- fit(sel_rec, model = GBMModel) as.MLModel(model_fit) #> --- MLModel object ---------------------------------------------------------------------- #> #> Model name: GBMModel #> Label: Trained Generalized Boosted Regression #> Package: gbm #> Response types: factor, numeric, PoissonVariate, Surv #> Case weights support: TRUE #> Missing case removal: response #> Tuning grid: TRUE #> Variable importance: TRUE #> #> Parameters: #> List of 5 #> $ n.trees : num 100 #> $ interaction.depth: num 1 #> $ n.minobsinnode : num 10 #> $ shrinkage : num 0.1 #> $ bag.fraction : num 0.5 #> #> === $TrainingStep1 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> SelectedModelRecipe log: #> # A tibble: 3 × 4 #> name selected params$id metrics$`C-Index` #> <chr> <lgl> <chr> <dbl> #> 1 ModelRecipe.1 FALSE input.azpi 0.761 #> 2 ModelRecipe.2 FALSE input.aLHo 0.643 #> 3 TunedModelRecipe TRUE input.W4KN 0.796 #> #> Selected row: 3 #> Metric: C-Index = 0.7960841 #> #> === $TrainingStep2 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> TunedModelRecipe log: #> # A tibble: 3 × 4 #> name selected params$PCA$num_comp metrics$`C-Index` #> <chr> <lgl> <int> <dbl> #> 1 ModelRecipe.1 TRUE 1 0.740 #> 2 ModelRecipe.2 FALSE 2 0.720 #> 3 ModelRecipe.3 FALSE 3 0.725 #> #> Selected row: 1 #> Metric: C-Index = 0.7399641
Selection can also be performed among traditional formulas, design matrices, or model frames.
## Traditional formulas fo1 <- y ~ sex + age + year + thickness + ulcer fo2 <- y ~ sex + age + year ## Selection among formulas sel_fo <- SelectedInput(fo1, fo2, data = surv_train) ## Input-selected model fit and final trained model model_fit <- fit(sel_fo, model = GBMModel) as.MLModel(model_fit)
In the previous examples, selection of different inputs was performed with the same model (r rdoc_url("GBMModel")
). Selection among different combinations of inputs and models is supported with the r rdoc_url("ModelSpecification()")
constructor.
## Different combinations of inputs and models sel_mfo <- SelectedInput( ModelSpecification(fo1, data = surv_train, model = CoxModel), ModelSpecification(fo2, data = surv_train, model = GBMModel) ) ## Input-selected model fit and final trained model model_fit <- fit(sel_mfo) as.MLModel(model_fit)
Models define the functional relationships between predictor and response variables from a given set of inputs.
Many of the package-supplied modeling functions have arguments, or tuning parameters, that control aspects of their model fitting algorithms. For example, r rdoc_url("GBMModel")
parameters n.trees
and interaction.depth
control the number of decision trees to fit and the maximum tree depths. When called with a r rdoc_url("TunedModel")
, the r rdoc_url("fit()")
function performs model fitting over a grid of parameter values and returns the model with the optimal values. Optimality is determined based on the first performance metric of the metrics
argument to r rdoc_url("TunedModel()")
if given or the first default metric of the r rdoc_url("performance()")
function otherwise. Argument grid
additionally controls the construction of grid values and can be a single integer or vector of integers whose positions or names match the parameters in a model's pre-defined tuning grid if one exists and which specify the number of values used to construct the grid. Pre-defined r rdoc_url("TunedModel")
grids can be extract and viewed apart from model fitting with r rdoc_url("expand_modelgrid()")
. As shown in the output below, r rdoc_url("as.MLModel()")
will extract a tuned model from fit results for viewing of the tuning parameter grid values, the names of models fit to each, all calculated metrics, the final model selected, the metric upon which its selection was based, and its parameters.
## Tune over automatic grid of model parameters model_fit <- fit(surv_fo, data = surv_train, model = TunedModel( GBMModel, grid = 3, control = surv_means_control, metrics = c("CIndex" = cindex, "RMSE" = rmse) )) (trained_model <- as.MLModel(model_fit)) #> --- MLModel object ---------------------------------------------------------------------- #> #> Model name: GBMModel #> Label: Trained Generalized Boosted Regression #> Package: gbm #> Response types: factor, numeric, PoissonVariate, Surv #> Case weights support: TRUE #> Missing case removal: response #> Tuning grid: TRUE #> Variable importance: TRUE #> #> Parameters: #> List of 5 #> $ n.trees : int 50 #> $ interaction.depth: int 1 #> $ n.minobsinnode : num 10 #> $ shrinkage : num 0.1 #> $ bag.fraction : num 0.5 #> #> === $TrainingStep1 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> TunedModel log: #> # A tibble: 9 × 4 #> name selected params$n.trees $interaction.depth metrics$CIndex $RMSE #> <chr> <lgl> <int> <int> <dbl> <dbl> #> 1 GBMModel.1 TRUE 50 1 0.765 3869. #> 2 GBMModel.2 FALSE 50 2 0.750 5450. #> 3 GBMModel.3 FALSE 50 3 0.739 7877. #> 4 GBMModel.4 FALSE 100 1 0.746 3919. #> 5 GBMModel.5 FALSE 100 2 0.730 5262. #> 6 GBMModel.6 FALSE 100 3 0.720 10596. #> 7 GBMModel.7 FALSE 150 1 0.726 4167. #> 8 GBMModel.8 FALSE 150 2 0.712 5428. #> 9 GBMModel.9 FALSE 150 3 0.699 16942. #> #> Selected row: 1 #> Metric: CIndex = 0.7647633
Grid values may also be given as a r rdoc_url("TuningGrid")
function, function name, or object; r rdoc_url("ParameterGrid")
object; or data frame containing parameter values at which to evaluate the model, such as that returned by r rdoc_url("expand_params()")
.
## Tune over randomly sampled grid points fit(surv_fo, data = surv_train, model = TunedModel( GBMModel, grid = TuningGrid(size = 100, random = 10), control = surv_means_control )) ## Tune over user-specified grid points fit(surv_fo, data = surv_train, model = TunedModel( GBMModel, grid = expand_params(n.trees = c(25, 50, 100), interaction.depth = 1:3), control = surv_means_control ))
Statistics summarizing the resampled performance metrics across all tuning parameter combinations can be obtained with the r rdoc_url("summary()")
and r rdoc_url("performance()")
functions.
summary(trained_model) #> --- $TrainingStep1 ---------------------------------------------------------------------- #> # A tibble: 9 × 4 #> name selected params$n.trees $interaction.depth metrics$CIndex $RMSE #> <chr> <lgl> <int> <int> <dbl> <dbl> #> 1 GBMModel.1 TRUE 50 1 0.765 3869. #> 2 GBMModel.2 FALSE 50 2 0.750 5450. #> 3 GBMModel.3 FALSE 50 3 0.739 7877. #> 4 GBMModel.4 FALSE 100 1 0.746 3919. #> 5 GBMModel.5 FALSE 100 2 0.730 5262. #> 6 GBMModel.6 FALSE 100 3 0.720 10596. #> 7 GBMModel.7 FALSE 150 1 0.726 4167. #> 8 GBMModel.8 FALSE 150 2 0.712 5428. #> 9 GBMModel.9 FALSE 150 3 0.699 16942. performance(trained_model) %>% lapply(summary) #> $TrainingStep1 #> , , Metric = CIndex #> #> Statistic #> Model Mean Median SD Min Max NA #> GBMModel.1 0.7647633 0.7630332 0.05151443 0.6497797 0.8632075 0 #> GBMModel.2 0.7502765 0.7549020 0.04636802 0.6688742 0.8349057 0 #> GBMModel.3 0.7388766 0.7500000 0.04594197 0.6621622 0.8197425 0 #> GBMModel.4 0.7463392 0.7601810 0.05087334 0.6497797 0.8160377 0 #> GBMModel.5 0.7301734 0.7345133 0.04361416 0.6621622 0.7847222 0 #> GBMModel.6 0.7199483 0.7122642 0.05168649 0.6216216 0.7889908 0 #> GBMModel.7 0.7263693 0.7385321 0.05069831 0.6351351 0.7896996 0 #> GBMModel.8 0.7121314 0.7169811 0.05331152 0.6199095 0.7847222 0 #> GBMModel.9 0.6994678 0.7156863 0.05314909 0.6081081 0.7639485 0 #> #> , , Metric = RMSE #> #> Statistic #> Model Mean Median SD Min Max NA #> GBMModel.1 3868.629 3321.249 1431.839 2123.875 7401.789 0 #> GBMModel.2 5450.093 4741.129 2438.173 2385.426 11695.273 0 #> GBMModel.3 7877.487 5827.739 5036.837 4192.463 20009.854 0 #> GBMModel.4 3919.126 3535.514 1454.999 2373.413 7558.280 0 #> GBMModel.5 5262.122 5052.969 1900.353 2280.193 8212.727 0 #> GBMModel.6 10595.888 8308.733 7057.327 4996.420 28610.367 0 #> GBMModel.7 4167.128 3789.596 1115.371 2587.440 6701.087 0 #> GBMModel.8 5427.518 6129.309 2313.742 2039.894 9212.369 0 #> GBMModel.9 16941.696 10865.160 16841.774 4914.743 71150.177 0
Line plots of tuning results display the resampled metric means, or another statistic specified with the stat
argument, versus the first tuning parameter values and with lines grouped according to the remaining parameters, if any.
plot(trained_model, type = "line") #> $TrainStep1
knitr::include_graphics("img/using_strategies_tune_plot-1.png")
Model selection can be conducted by calling r rdoc_url("fit()")
with a r rdoc_url("SelectedModel")
to automatically choose from any combination of models and model parameters. Selection has as a special case the just-discussed tuning of a single model over a grid of parameter values. Combinations of model functions, function names, or objects can be supplied to r rdoc_url("SelectedModel()")
in order to define sets of candidate models from which to select. An r rdoc_url("expand_model()")
helper function is additionally available to expand a model over a grid of tuning parameters for inclusion in the candidate set if so desired.
## Model interface for model selection sel_model <- SelectedModel( expand_model(GBMModel, n.trees = c(50, 100), interaction.depth = 1:2), GLMNetModel(lambda = 0.01), CoxModel, SurvRegModel ) ## Fit the selected model fit(surv_fo, data = surv_train, model = sel_model)
Selection may also be performed over candidate sets that include tuned models. For instance, the r rdoc_url("SelectedModel()")
function is applicable to sets containing different classes of models each individually tuned over a grid of parameters.
## Model interface for selection among tuned models sel_tun_model <- SelectedModel( TunedModel(GBMModel, control = surv_means_control), TunedModel(GLMNetModel, control = surv_means_control), TunedModel(CoxModel, control = surv_means_control) ) ## Fit the selected tuned model fit(surv_fo, data = surv_train, model = sel_tun_model)
Ensemble learning models combine $m = 1, \ldots, M$ base models as a strategy to improve predictive performance. Two methods implemented in MachineShop are stacked regression [@breiman:1996:SR] and super learners [@vanderLaan:2007:SL]. Stacked regression fits a linear combination of predictions from specified base learners to produce a prediction function of the form $$ \hat{f}(x) = \sum_{m=1}^M \hat{w}m \hat{f}_m(x). $$ Stacking weights $w$ are estimated by (constrained) least squares regression of case responses $y_i$ on predictions $\hat{f}_m^{-\kappa(i)}(x_i)$ from learners fit to data subsamples $-\kappa(i)$ not containing the corresponding cases. In particular, they are obtained as the solution $$ \hat{w} = \underset{w}{\operatorname{argmin}} \sum{i=1}^{N}\left(y_i - \sum_{m=1}^{M} w_m \hat{f}_m^{-\kappa(i)}(x_i) \right)^2 $$ subject to the constraints that all $w_m \ge 0$ and $\sum_m w_m = 1$. K-fold cross-validation is the default subsampling method employed in the estimation, with the other resampling methods provided by the package available as options. Survival outcomes are handled with a modified version of the stacked regression algorithm in which
Super learners are a generalization of stacked regression that fit a specified model, such as r rdoc_url("GBMModel")
, to case responses $y_i$, base learner predictions $\hat{f}_m^{-\kappa(i)}(x_i)$, and optionally also to the original predictor variables $x_i$. Given below are examples of a stacked regression and super learner each fit with gradient boosted, random forest, and Cox regression base learners. A separate gradient boosted model is used as the super learner in the latter.
## Stacked regression stackedmodel <- StackedModel(CoxModel, CForestModel, GLMBoostModel) res_stacked <- resample(surv_fo, data = surv_train, model = stackedmodel) summary(res_stacked) #> Statistic #> Metric Mean Median SD Min Max NA #> C-Index 0.7294869 0.762963 0.1275484 0.5194805 0.8556701 0 ## Super learner supermodel <- SuperModel(CoxModel, CForestModel, GLMBoostModel, model = GBMModel) res_super <- resample(surv_fo, data = surv_train, model = supermodel) summary(res_super) #> Statistic #> Metric Mean Median SD Min Max NA #> C-Index 0.7534803 0.8325472 0.1243104 0.5748899 0.8454545 0
Combinations and multiple levels of nested meta-functions, inputs, and models are possible. If model fitting involves a single meta-function, performances of the inputs or models under consideration are estimated with standard resampling, and the best performing model is returned. Nestings of meta-functions are trained with nested resampling. Consider the example below in which training involves input tuning and model selection. In particular, a preprocessing recipe is tuned over the number of predictor-derived principal components and model selection is of an untuned r rdoc_url("GBMModel")
, a tuned r rdoc_url("GBMModel")
, and a r rdoc_url("StackedModel")
.
## Preprocessing recipe with PCA steps pca_rec <- recipe(y ~ ., data = surv_train) %>% role_case(stratum = y) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) %>% step_pca(all_predictors(), id = "PCA") ## Tuning grid of number of PCA components pca_grid <- expand_steps( PCA = list(num_comp = 1:3) ) ## Input specification tun_rec <- TunedInput(pca_rec, grid = pca_grid) ## Model specification sel_model <- SelectedModel( GBMModel, TunedModel(GBMModel), StackedModel(CoxModel, TunedModel(CForestModel), TunedModel(GBMModel)) ) ## Model fit and final trained model model_fit <- fit(tun_rec, model = sel_model) as.MLModel(model_fit)
Model fitting proceeds with instances of the specified model selection nested within each of the input tuning grid parameter values. Tuning of r rdoc_url("GBMModel")
and construction of r rdoc_url("StackedModel")
are further nested within the model selection, with tuning of r rdoc_url("CForestModel")
and r rdoc_url("GBMModel")
nested within r rdoc_url("StackedModel")
. Altogether, there are four levels of meta-input and meta-model functions in the hierarchy.
knitr::include_graphics("img/FigModelDAG.png")
Each meta-function is fit based on resample estimation (default: cross-validation) of predictive performance. When one meta-function is nested within another, nested resampling is employed, as illustrated in the figure below.
knitr::include_graphics("img/FigNestedCV.png")
Nesting of resampling routines is repeated recursively when a fit involves multiple levels of nested meta-functions. For example, predictive performance estimation for the training of r rdoc_url("TunedInput(pca_rec, grid = pca_grid)")
involves up to three nested meta functions: r rdoc_url("SelectedModel(...)")
→ r rdoc_url("StackedModel(...)")
→ r rdoc_url("TunedModel(CForestModel)")
. For this relationship, an outer and three nested inner resampling loops are executed as follows. First, r rdoc_url("CForestModel")
is tuned at the third inner resampling loop. Second, the tuned model is passed to the second inner loop for construction of r rdoc_url("StackedModel")
. Third, the constructed model is passed to the first inner loop for model selection from the candidate set. Finally, the selected model is passed to the outer loop for tuning of the preprocessing recipe. Based on resample performance estimation of the entire input/model specification, one principal component is selected for the GBMModel
.
#> === $TrainingStep1 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> TunedModelRecipe log: #> # A tibble: 3 × 4 #> name selected params$PCA$num_comp metrics$`C-Index` #> <chr> <lgl> <int> <dbl> #> 1 ModelRecipe.1 TRUE 1 0.741 #> 2 ModelRecipe.2 FALSE 2 0.730 #> 3 ModelRecipe.3 FALSE 3 0.702 #> #> Selected row: 1 #> Metric: C-Index = 0.7405534
In order to identify and return a final model fitted to the entire input data, the hierarchy is traversed from top to bottom along the path determined by the choice at each node. Steps along the path are labeled TrainStep1
and TrainStep2
in the output. As seen above in TrainStep1
, one principal component is first selected for the tuned input. Using an input recipe with one principal component, the entire dataset is refit at TrainStep2
to finally select r rdoc_url("GBMModel")
.
#> === $TrainingStep2 ====================================================================== #> === TrainingStep object === #> #> Optimization method: Grid Search #> SelectedModel log: #> # A tibble: 3 × 4 #> name selected params$id metrics$`C-Index` #> <chr> <lgl> <chr> <dbl> #> 1 GBMModel TRUE model.1mQk 0.740 #> 2 TunedModel FALSE model.hRF5 0.735 #> 3 StackedModel FALSE model.CYj0 0.653 #> #> Selected row: 1 #> Metric: C-Index = 0.7399641
After the series of training steps reaches the bottom of its hierarchy, the final model is fitted to the entire dataset and returned.
#> --- MLModel object ---------------------------------------------------------------------- #> #> Model name: GBMModel #> Label: Trained Generalized Boosted Regression #> Package: gbm #> Response types: factor, numeric, PoissonVariate, Surv #> Case weights support: TRUE #> Missing case removal: response #> Tuning grid: TRUE #> Variable importance: TRUE #> #> Parameters: #> List of 5 #> $ n.trees : num 100 #> $ interaction.depth: num 1 #> $ n.minobsinnode : num 10 #> $ shrinkage : num 0.1 #> $ bag.fraction : num 0.5
Generalization performance of the entire process can be estimated with a call to r rdoc_url("resample()")
.
## Generalization performance of the modeling strategy resample(tun_rec, model = sel_model)
There is no conceptual limit to the number of nested inputs and models that can be specified with the package. However, there are some practical issues to consider.
Computational Expense : Computational expense of nested resampling increases exponentially. For instance, execution of r levels of a nested 10-fold cross-validation algorithm is an O(10^r^) operation. Runtimes can be decreased by registering multiple cores to run the resampling algorithms in parallel. However, the exponential increase in computational complexity quickly outpaces the number of available cores.
Data Reduction : Training data is reduced at each subsequent resampling level. For 10-fold cross-validation and a training set of N total cases, there will be 0.9^r^ cases available at each fold of the r^th^ resampling algorithm. Bootstrapping could be used, as an alternative to cross-validation, to ensure N cases at each resampling level. However, the number of unique cases at level r will be decreased to approximately N(2/3)^r^.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.