library(learnr) library(caret) library(gbm) library(rpart) library(pROC) library(mlforsocialscience)
In this notebook we (again) use the drug consumption data. The data contains records for 1885 respondents with personality measurements (e.g. Big-5), level of education, age, gender, country of residence and ethnicity as features. In addition, information on the usage of 18 drugs is included.
Source: https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29
data(drugs)
First, we build a dummy variable on LSD usage as our outcome of interest.
drugs$D_LSD <- "LSD" drugs$D_LSD[drugs$LSD == "CL0"] <- "no_LSD" drugs$D_LSD <- as.factor(drugs$D_LSD) drugs$D_LSD <- relevel(drugs$D_LSD, "no_LSD") table(drugs$LSD, drugs$D_LSD) summary(drugs$D_LSD)
Then we split the data into a training and a test part, using createDataPartition
from caret
.
set.seed(9453) inTrain <- createDataPartition(drugs$D_LSD, p = .8, list = FALSE, times = 1) drugs_train <- drugs[inTrain,] drugs_test <- drugs[-inTrain,]
In order to build a set of prediction models it is helpful to follow the caret
workflow and first decide how to conduct model tuning. Here we use 5-Fold Cross-Validation, mainly to keep computation time to a minimum. caret
offers many performance metrics, however, they are stored in different functions that need to be combined first.
evalStats <- function(...) c(twoClassSummary(...), defaultSummary(...), mnLogLoss(...)) ?defaultSummary()
Now we can specify the trainControl
object.
ctrl <- trainControl(method = "cv", number = 5, summaryFunction = evalStats, classProbs = TRUE)
As a first method we try out AdaBoost as implemented in the fastAdaboost
package. Specifically, Adaboost.M1 will be used with three try-out values for the number of iterations.
grid <- expand.grid(nIter = c(50, 100, 150), method = "Adaboost.M1")
Now we can pass this on to train
, along with the specification of the model and the method, i.e. adaboost
.
set.seed(744) ada <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "adaboost", trControl = ctrl, grid = grid, metric = "logLoss")
List the results of the tuning process.
ada
For Gradient Boosting as implemented by the gbm
package, we have to take care of a number of tuning parameters. Now the expand.grid
is helpful as it creates an object with all possible combinations of our try-out values.
grid <- expand.grid(interaction.depth = 1:3, n.trees = c(500, 750, 1000), shrinkage = c(0.05, 0.01), n.minobsinnode = 10)
List the tuning grid...
grid
...and begin the tuning process.
set.seed(744) gbm <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "gbm", trControl = ctrl, tuneGrid = grid, metric = "logLoss", distribution = "bernoulli", verbose = FALSE)
Instead of just printing the results from the tuning process, we can also plot them.
plot(gbm)
We can also extract single trees of the GBM ensemble.
pretty.gbm.tree(gbm$finalModel, i.tree = 1) pretty.gbm.tree(gbm$finalModel, i.tree = 2)
A quick look at feature importances.
plot(varImp(gbm), top = 15)
Here we add a single tree for comparison, using CART. With rpart2
, max tree depth is the (only) tuning parameter.
grid <- expand.grid(maxdepth = 1:15)
Run the tuning process with train()
.
set.seed(744) cart <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "rpart2", trControl = ctrl, tuneGrid = grid, metric = "logLoss")
Plot and print the results.
plot(cart)
cart
Finally we also add a logistic regression model. Obviously we have no tuning parameter here.
set.seed(744) logit <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + Openness + Agreeableness + Conscientiousness + Impulsive + SS, data = drugs_train, method = "glm", trControl = ctrl)
We may want to take a glimpse at the regression results.
summary(logit)
For evaluating performance, we predict the outcome in the test set in two formats. We use predict
for predicting class membership and probabilities based on each model, respectively.
c_ada <- predict(ada, newdata = drugs_test) c_gbm <- predict(gbm, newdata = drugs_test) c_cart <- predict(cart, newdata = drugs_test) c_logit <- predict(logit, newdata = drugs_test) p_ada <- predict(ada, newdata = drugs_test, type = "prob") p_gbm <- predict(gbm, newdata = drugs_test, type = "prob") p_cart <- predict(cart, newdata = drugs_test, type = "prob") p_logit <- predict(logit, newdata = drugs_test, type = "prob")
Given predicted class membership, we can use the function postResample
in order to get a short summary of each models' performance in the test set.
postResample(pred = c_ada, obs = drugs_test$D_LSD) postResample(pred = c_gbm, obs = drugs_test$D_LSD) postResample(pred = c_cart, obs = drugs_test$D_LSD) postResample(pred = c_logit, obs = drugs_test$D_LSD)
Creating ROC
objects based on predicted probabilities...
ada_roc <- roc(drugs_test$D_LSD, p_ada$LSD) gbm_roc <- roc(drugs_test$D_LSD, p_gbm$LSD) cart_roc <- roc(drugs_test$D_LSD, p_cart$LSD) logit_roc <- roc(drugs_test$D_LSD, p_logit$LSD)
...and plotting the ROC curves.
ggroc(list(Adaboost = ada_roc, GBM = gbm_roc, CART = cart_roc, Logit = logit_roc)) + geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="darkgrey", linetype="dashed")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.