In helophilus/ColsTools: A variety of convenience tools and short-cuts

This vignette is a short example of cross-validating a model. Begin by creating some artificial data:

DF = data.frame(Grp = rep(LETTERS[24:25],c(70,30)),
                a = round(rnorm(100,50,5),1),
                b = round(runif(100,10,90)),
                c = round(runif(100,50,100)))

We will create a linear model using 'a' as the response variable:

$$ a = \alpha + \beta_1b +\beta_2c + \beta_3Grp $$

We will carry out 5-fold cross validation on this model. To do that, we'll split the data into five training and test sets where the test set is unique in each case. Also, we will preserve the ratio of the Grp membership, that is the sampling will be stratified.

So we begin by using Kcross:

library(ColsTools)
DFindexes = Kcross(DF = DF, K = 5, Strat = 'Grp')
str(DFindexes)

Now we use ModCV to carry out cross validation of the model within lapply. Currently, this only works for a regression model, i.e., a response variable that is continuous. This will store results in a new list. The results are mean squared error (using the mse function), the actual errors difference between predictions and actual values in each test set), and the model coefficients.

CV = lapply(1:5, FUN = ModCV, indexlist = DFindexes, Formula = formula(a~.),
            DFrame = DF, Func = lm, Resp = 'a')

The first argument is an index corresponding to the number of folds, then we have the function with its arguments. We specify the model as linear model (lm), but equally, we can use another type of model with the same formula structure, e.g., random forest.

We can look at the five mean squared error values:

unlist(lapply(CV, FUN = function(x) c(x$MSE)))

We can also, e.g., produce a histogram of all error values:

Allerrors = unlist(lapply(CV, FUN = function(x) c(x$Error)))
hist(Allerrors, main="", breaks=12)

The code follows the same format as for extracting the mse values. Getting the coefficients, is similar but uses sapply:

Tab = t(sapply(CV, FUN = function(x) rbind(x$Coef)))
colnames(Tab) = names(CV[[1]]$Coef)
Tab

And by way of example, here's the same procedure using random forest regression.

library(randomForest)
CVRF = lapply(1:5, FUN = ModCV, indexlist = DFindexes, Formula = formula(a~.),
            DFrame = DF, Func = randomForest, Resp = 'a')
unlist(lapply(CVRF, FUN = function(x) c(x$MSE)))

And a neural network. Here, we borrow the formula from the linear model earlier as the function can't handle the short hand for all other variables:

library(neuralnet)
CVNN = lapply(1:5, FUN = ModCV, indexlist = DFindexes, Formula = formula(a~.),
            DFrame = DF, Func = randomForest, Resp = 'a')
unlist(lapply(CVNN, FUN = function(x) c(x$MSE)))