Setup

library(learnr)
library(caret)
library(xgboost)
library(mboost)
library(ranger)
library(mlforsocialscience)

Data

In this notebook we (again) use the drug consumption data. The data contains records for 1885 respondents with personality measurements (e.g. Big-5), level of education, age, gender, country of residence and ethnicity as features. In addition, information on the usage of 18 drugs is included.

Source: https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29

data(drugs)

First, we build a dummy variable on LSD usage as our outcome of interest.

drugs$D_LSD <- "LSD"
drugs$D_LSD[drugs$LSD == "CL0"] <- "no_LSD"
drugs$D_LSD <- as.factor(drugs$D_LSD)
summary(drugs$D_LSD)

Then we split the data into a training and a test part, using createDataPartition from caret.

set.seed(9453)
inTrain <- createDataPartition(drugs$D_LSD, 
                               p = .8, 
                               list = FALSE, 
                               times = 1)

drugs_train <- drugs[inTrain,]
drugs_test <- drugs[-inTrain,]

XGBoost

We start with xgboost as our ML method, which is a efficient and competitive Boosting implementation. We first specify the resampling method for tuning with trainControl().

ctrl  <- trainControl(method = "cv",
                      number = 5,
                      summaryFunction = twoClassSummary,
                      verboseIter = TRUE,
                      classProbs = TRUE)

Using XGBoost requires to take care of many tuning parameters (see ?xgboost). For this example, we fix most of them.

grid <- expand.grid(max_depth = c(1, 3, 5),
                    nrounds = c(500, 1000, 1500),
                    eta = c(0.05, 0.01, 0.005),
                    min_child_weight = 10,
                    subsample = 0.7,
                    gamma = 0,
                    colsample_bytree = 1)

Print the tuning grid.

grid

The grid object can now be used in the train function to guide the tuning process.

set.seed(8303)
xgb <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion +
             Openness + Agreeableness + Conscientiousness + Impulsive + SS,
             data = drugs_train,
             method = "xgbTree",
             trControl = ctrl,
             tuneGrid = grid,
             metric = "ROC")

Plot the tuning results.

plot(xgb)

mboost

Here we use Model-Based Boosting as an alternative boosting approach. It has considerably fewer tuning parameters as XGBoost -- in fact, via caret we primarily have to take care of the number of boosting iterations.

grid <- expand.grid(mstop = c(50, 100, 150, 200, 250, 500),
                    prune = 'no')

Now we run the tuning process with train(), using glmboost as the prediction method.

set.seed(8303)
mboost <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion +
                Openness + Agreeableness + Conscientiousness + Impulsive + SS,
                data = drugs_train,
                method = "glmboost",
                trControl = ctrl,
                tuneGrid = grid,
                metric = "ROC")

Print the tuning results.

mboost

With mboost it is possible to access model "coefficients".

coef(mboost$finalModel)

Random Forest (and random search)

This section exemplifies the usage of random search (as opposed to grid search). In caret, random search is available as an option of the trainControl() function.

ctrl  <- trainControl(method = "cv",
                      number = 5,
                      summaryFunction = twoClassSummary,
                      classProbs = TRUE,
                      search = "random")

Note that now we don't set up a tuning grid, but specify the number of random picks of tuning parameters settings via tuneLength in train(). We use ranger as our ML method, which implements random forests and extremely randomized trees.

set.seed(8303)
ranger <- train(D_LSD ~ Age + Gender + Education + Neuroticism + Extraversion + 
                Openness + Agreeableness + Conscientiousness + Impulsive + SS,
                data = drugs_train,
                method = "ranger",
                trControl = ctrl,
                tuneLength = 10,
                metric = "ROC")

Print the tuning results (and show the randomly selected try-out values).

ranger

Comparison

This section shows some options for comparing the CV results of different methods. After we ran a bunch of models, we can use carets resamples() function to gather the cross-validation results from all of them.

resamps <- resamples(list(XGBoost = xgb,
                          mboost = mboost,
                          RF = ranger))

This object can now be used for comparing these models with respect to their performance, based on Cross-Validation in the training set.

summary(resamps)

We can also plot this information.

bwplot(resamps)

Another option is to compare the resampling distributions with scatterplots.

splom(resamps)

References



kimbrianj/mlforsocialscience documentation built on March 12, 2024, 12:07 a.m.