In hojsgaard/doBy: Groupwise Statistics, LSmeans, Linear Estimates, Utilities

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options("digits"=3)
library(doBy)
library(boot)
#devtools::load_all()

Stability selection: differences in models due to different data sets

Suppose we have collection of regression models $M_1, M_2, \ldots, M_p$. By a regression model we here mean a set of predictors and a model for the response. To asses the predictive ability of the models we perform cross validation: Fit each model to one set of data and predict another set of data, do so repeatedly. Compute a prediction error for each model. The model with the smallest prediction error is the best model.

But before we get there, we need to select a model to generate the collection of candidate models. One approach is the following:

A model selection method consists of two parts:

A criterion for selecting a model and
A strategy for generating candidate models, that is a strategy for navigating the model space.

If we have $K$ datasets we can perform $K$ model selections, and this gives us a collection of candidate models. We can then perform cross validation on these models. Say we want to perform $K$ different model selections. We can do this by resampling or by creating subgroups. Resampling is done by drawing $n$ observations with replacement from the data set. Subgroups are created by dividing the data set into $K$ subgroups as follows: Form $K$ subgroups by dividing the data set into $K$ equal parts. Then for each $k=1,\ldots,K$ the $k$th subgroup is left out and the model is fitted to the remaining data. Default below is the subgroup method.

Generating data sets

Consider this subset of the mtcars dataset:

dat0 <- mtcars[1:6,]
dat0

Generate a list of data sets by resampling or by creating subgroups.

generate_data_list(dat0, K=3, method="subgroups")
generate_data_list(dat0, K=3, method="resample")

Model selection

Focus on entire dataset. For evaluation of the predictive ability of the selected model we set 20% of the data aside for validation. Hence we select models based on the remaining 80% of the data. Fit a linear model to each data set and select the best model by stepwise regression. This gives an indication of the stability of the model.

set.seed(1411)
dat <- personality
train <- sample(1:nrow(dat), 0.6*nrow(dat))
dat.training <- dat[train, ]
dat.testing <- dat[-train, ]

mod1 <- glm(agreebl ~ ., data = dat.training)

For each fold we can select models on the analysis part of the data set.

set.seed(1411)
n.searches <- 12
mod_stab <- model_stability_glm(dat.training, mod1,
                           n.searches=n.searches, method="subgroups", trace=0)
mod_stab
summary(mod_stab)

Specifically, the matrix shows the predictors selected in each model and below is the frequency with which each model is selected.

The outcome of the steps above is a list of models and as it is the same model selection method applied to each group of data, the difference (if any) in models is due to differences in the data in each group.

We can obtain fitted models as follows

formula(mod_stab, fit=FALSE) [1:2]

This gives us a set of candidate models.

Predictive performance

We have selected a model based on subgroups of the training data. We can now evaluate the predictive performance of the selected model on the test data.

fit_list <- formula(mod_stab, fit=TRUE)

set.seed(1411)
cv.error <- cv_glm_fitlist(dat.training, fit_list, K=4)
cv.error


cv3 <- sapply(fit_list, 
       function(fit) {
         x <- update(fit, data=dat.training)
         modelr::rmse(x, dat.testing)
       }
)
cv3

plot(cv.error, cv3)

We can now evaluate the predictive performance of the selected model on the test data.

par(mfrow=c(1,2))
plot(sqrt(cv.error), main="Subgroups")
plot(cv3, main="Subgroups")

The plot on the left shows the cross validation error for the models selected in the subgroups. The plot on the right shows the prediction error for the models on the test data.