library(learnr) library(caret) library(xgboost) library(mboost) library(ranger) library(mlforsocialscience)
In this notebook we (again) use the drug consumption data. The data contains records for 1885 respondents with personality measurements (e.g. Big-5), level of education, age, gender, country of residence and ethnicity as features. In addition, information on the usage of 18 drugs is included.
Source: https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
data(drugs)
First, we build a dummy variable on LSD usage as our outcome of interest.
drugs$D_LSD <- "LSD" drugs$D_LSD[drugs$LSD == "CL0"] <- "no_LSD" drugs$D_LSD <- as.factor(drugs$D_LSD) summary(drugs$D_LSD)
Then we split the data into a training and a test part, using createDataPartition
from caret
.
set.seed(9453) inTrain <- createDataPartition(drugs$D_LSD, p = .8, list = FALSE, times = 1) drugs_train <- drugs[inTrain,] drugs_test <- drugs[-inTrain,]
We start with xgboost
as our ML method, which is a efficient and competitive Boosting implementation. We first specify the resampling method for tuning with trainControl()
.
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, verboseIter = TRUE, classProbs = TRUE)
Using XGBoost requires to take care of many tuning parameters (see ?xgboost
). For this example, we fix most of them.
grid <- expand.grid(max_depth = c(1, 3, 5), nrounds = c(500, 1000, 1500), eta = c(0.05, 0.01, 0.005), min_child_weight = 10, subsample = 0.7, gamma = 0, colsample_bytree = 1)
Print the tuning grid.
grid
The grid object can now be used in the train
function to guide the tuning process.
set.seed(8303) xgb <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "xgbTree", trControl = ctrl, tuneGrid = grid, metric = "ROC")
Plot the tuning results.
plot(xgb)
Here we use Model-Based Boosting as an alternative boosting approach. It has considerably fewer tuning parameters as XGBoost -- in fact, via caret
we primarily have to take care of the number of boosting iterations.
grid <- expand.grid(mstop = c(50, 100, 150, 200, 250, 500), prune = 'no')
Now we run the tuning process with train()
, using glmboost
as the prediction method.
set.seed(8303) mboost <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "glmboost", trControl = ctrl, tuneGrid = grid, metric = "ROC")
Print the tuning results.
mboost
With mboost
it is possible to access model "coefficients".
coef(mboost$finalModel)
This section exemplifies the usage of random search (as opposed to grid search). In caret
, random search is available as an option of the trainControl()
function.
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = twoClassSummary, classProbs = TRUE, search = "random")
Note that now we don't set up a tuning grid, but specify the number of random picks of tuning parameters settings via tuneLength
in train()
. We use ranger
as our ML method, which implements random forests and extremely randomized trees.
set.seed(8303) ranger <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "ranger", trControl = ctrl, tuneLength = 10, metric = "ROC")
Print the tuning results (and show the randomly selected try-out values).
ranger
This section shows some options for comparing the CV results of different methods. After we ran a bunch of models, we can use carets resamples()
function to gather the cross-validation results from all of them.
resamps <- resamples(list(XGBoost = xgb, mboost = mboost, RF = ranger))
This object can now be used for comparing these models with respect to their performance, based on Cross-Validation in the training set.
summary(resamps)
We can also plot this information.
bwplot(resamps)
Another option is to compare the resampling distributions with scatterplots.
splom(resamps)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.