The caret package~\cite{caret} is a great little package that creates an interface to a myriad of model building tools included in various packages in the R universe. It is particularly nice as it provides a single set of functions that can be used for fitting, prediction and assessment of models, removing the need to remember any syntactic differences over the other packages. It is essentially a fancy wrapper for all of the modelling tools that we might want to use. To install this package:
set.seed(1)
install.packages("caret")
The vignette for this package gives a little introduction which can be viewed using
vignette("caret", package = "caret")
There is also a website with lots of information about this package. There are many models that can be fitted using this package, a list of which can be found here and is a useful reference.We use this package throughout today as the mechanism for training and predicting from models is consistent.
The function train
I can use to fit my model given a set of training data,
and predict()
provides an interface for prediction given new values of the
predictors. As with lots of R functions there are a number of arguments to each
of these functions and we will use a number of these throughout the course.
There are lots of nice features included with the caret
package some of
which I will mention here.
caret
has built into it's train function the ability to use parallel
computation for model fitting. The parallel computation typically applies to the
resampling methods that are used to asses model performance with models being
fit and tested using a data partition on each thread.
Using this functionality is trivial on a linux machine, in fact by default the train method will use parallel computing if available, so all we need to do is set up an environment and make R aware of it.
library("doMC") registerDoMC(cores = 8)
This unfortunately doesn't work on Windows, however we can use the
doParallel
package to acheive the same thing:
library("doParallel") cl = makeCluster(8) registerDoParallel(cl)
And that's it. If there is a registered parallel back end the train function will use it.
Using help pages from caret
is a little more difficult to follow than with
standard functions like plot, particularly when it comes to model types. The
standard notation for help pages can be used
?train ?varImp
but for information about models and their parameters we need to look in the packages that the models come from. The website above gives information about where model functions come from. To look at a particular function in a particular package we can use
?ipred::treebag
which looks for the help function for the treebag
method that is in the ipred
package that we used earlier for bagged trees.
There are lots of plot functions associated with different caret objects. They
often work by just using the plot command with a different plot from each
object. To see a plot methods
help file you can use ?plot.<<object type>>
notation:
?plot.train ?plot.varImp.train
Try the following example
library("caret") library("pls") data(diamonds, package = "ggplot2") i = sample(nrow(diamonds), 1000) # some subset to help plotting diamonds = diamonds[i, ] m = train(price~., method = "pls", data = diamonds, tuneLength = 10) # a plot of model object gives us the resampling # information across tuning parameters plot(m) ## using the varImp function with plot we get ## variable importance scores plot(varImp(m)) # a plot of the final model shows predicted against # observed values plot(m$finalModel) ## a plot of residuals against fitted values plot(fitted.values(m), resid(m))
There are lots of objects that are part of a train object which can be explored easily since each one is just a list
names(m)
in addition
str(m)
gives lots of information about the structure of the object, the output of which has been omitted here.
The trainControl
function takes a number of arguments for helping to choose
which model to keep. For full details see the help page but here are a few
useful ones:
method
-- the method used for estimation of statistics, options include
boot
, cv
and LOOCV
index
-- for specifying a particular training set via indicies, by default the rest will be used for validationSome useful arguments for the train
function:
RMSE
, Accuraccy
where appropriatecenter
and scale
nbagg
for the treebag methodAdd the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.