library(learnr) library(tidyverse) library(magrittr) library(titanic) library(caret) library(DMwR) library(ranger) library(party) library(caretEnsemble) library(SuperLearner) library(pROC)
In this notebook we use the Titanic data. It includes information on a set of Titanic passengers, such as age, sex, ticket class and whether he or she survived the Titanic tragedy (Note that the titanic
package also provides a separate test set that precludes the survival variable).
Source: https://www.kaggle.com/c/titanic/data
titanic <- titanic_train str(titanic)
We begin with some minor data preparations.
titanic$Survived <- as.factor(titanic$Survived) levels(titanic$Survived) <- make.names(levels(factor(titanic$Survived))) titanic %<>% select(Survived, Pclass, Sex, Age, Fare) %>% na.omit(.)
Next we split the data into a training and a test part. This can be done by stratified random sampling with createDataPartition()
.
set.seed(3225) inTrain <- createDataPartition(titanic$Survived, p = .8, list = FALSE, times = 1) titanic_train <- titanic[inTrain,] titanic_test <- titanic[-inTrain,]
In this notebook, we will consider another add-on in the tuning process. We use stratified cross-validation since we have imbalanced classes in our outcome. We therefore set up a cv-index using createFolds()
.
cvIndex <- createFolds(titanic_train$Survived, 5, returnTrain = T)
The cvIndex object can now be passed on to trainControl()
to guide the tuning process in the next sections.
ctrl <- trainControl(method = "cv", number = 5, index = cvIndex, summaryFunction = twoClassSummary, classProbs = TRUE, verboseIter = TRUE, savePredictions = "final")
We start with random forests and extremely randomized trees. In order to specify reasonable values for mtry
, model.matrix()
is a handy function as it creates dummy variables for all factors of a specified model.
cols <- ncol(model.matrix(Survived ~ ., data = titanic_train))
Here we only consider two try-out values to limit the time needed for model tuning. We specify the tree building methods via splitrule
.
grid <- expand.grid(mtry = c(sqrt(cols), log(cols)), splitrule = c("gini", "extratrees"), min.node.size = 10) grid
Start the tuning process. We are looking for the model with the maximum ROC-AUC.
rf <- train(Survived ~ ., data = titanic_train, method = "ranger", trControl = ctrl, tuneGrid = grid, metric = "ROC")
List the tuning results.
rf
We may also want to consider a meta-model that is built on top of the predictions of lower level models. Here we consider the caretEnsemble
package. For simplicity, we only use CTREE and a logistic regression as base methods. As a first step, the lower level models have to be build.
model_list <- caretList(Survived ~ ., data = titanic_train, trControl = ctrl, metric = "ROC", methodList = c("ctree", "glm"))
The resulting object includes the training and tuning results for both methods.
model_list
We can briefly check the similarity of the predictions across models.
as.data.frame(predict(model_list, newdata = head(titanic_train))) modelCor(resamples(model_list))
Now we build a higher level model by using the predictions of the previous models as inputs in a logistic regression. This can be implemented with caretStack()
.
glm_ensemble <- caretStack(model_list, method = "glm", metric = "ROC", trControl = trainControl( method = "cv", number = 5, savePredictions = "final", classProbs = TRUE, summaryFunction = twoClassSummary))
List the cross-validation results of the stacking ensemble.
glm_ensemble
We can also access the coefficients of the meta logit model via coef
.
coef(glm_ensemble$ens_model$finalModel)
As an alternative to caretEnsemble
, the SuperLearner
package can be used for building a meta-ensemble model. This package needs a slightly different data setup, i.e. X and y objects for the train and test data.
X_train <- titanic_train[which(names(titanic_train) != "Survived")] y_train <- ifelse(titanic_train$Survived == "X1", 1, 0) X_test <- titanic_test[which(names(titanic_test) != "Survived")]
The function listWrappers()
lists the model types that we can use as individual learners.
listWrappers()
Here we choose a null model (SL.mean
), elastic net (SL.glmnet
) and random forests grown by ranger (SL.ranger
) for building our Super Learner. Details on the construction of the higher level model can be found by calling ?method.template
.
sl <- SuperLearner(Y = y_train, X = X_train, family = binomial(), SL.library = c("SL.mean", "SL.glmnet", "SL.ranger")) sl
We can also run nested Cross-Validation to get CV estimates for the performance of the individual models and the Super Learner.
cv_sl <- CV.SuperLearner(Y = y_train, X = X_train, family = binomial(), V = 5, SL.library = c("SL.mean", "SL.glmnet", "SL.ranger", "SL.xgboost")) cv_sl summary(cv_sl)
Plot the nested CV results.
plot(cv_sl)
Finally, we create predicted risk scores for our outcome in the test set.
p_rf <- predict(rf, newdata = titanic_test, type = "prob") p_rf_s <- predict(rf_s, newdata = titanic_test, type = "prob") p_ens <- predict(glm_ensemble, newdata = titanic_test, type = "prob") p_sl <- predict(sl, X_test, onlySL = TRUE)
Creating ROC objects based on predicted probabilities...
rf_roc <- roc(titanic_test$Survived, p_rf$X1) rf_s_roc <- roc(titanic_test$Survived, p_rf_s$X1) ens_roc <- roc(titanic_test$Survived, p_ens) sl_roc <- roc(titanic_test$Survived, p_sl$pred[, 1])
...and plotting the ROC curves.
ggroc(list(RF = rf_roc, Caret_Stack = ens_roc, SuperLearner = sl_roc)) + geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="darkgrey", linetype="dashed")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.