library(knitr) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE )
The EWStools package is designed to automate many of the model building and
checking tests that are common in predictive analytics, specifically for the
education domain. We will focus on this suite of tools around the ROC classification
procedure. To do this, let's work with a simple simulated dataset we can
construct using the twoClassSim function in the caret package:
set.seed(442) library(caret); library(MASS); library(pROC) trainD <- twoClassSim(n = 5000, intercept = -8, linearVars = 3, noiseVars = 10, corrVars = 4, corrValue = 0.6) testD <- twoClassSim(n = 1500, intercept = -7, linearVars = 3, noiseVars = 10, corrVars = 4, corrValue = 0.6)
Let's see what this produces
head(trainD[, c(1:5, 20:23)]) table(trainD$Class)
Our training data has an imbalanced class structure and 22 predictors that are scaled and centered.
A key thing to note is that the train and test data have the exact same
variable names and scales:
names(trainD) names(testD)
Now let's build a model:
ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary) fullModel <- train(Class ~ . , data = trainD, method = "knn", preProc = c("center", "scale"), tuneLength = 5, metric = "ROC", trControl = ctrl) fullModel
The standard output from the caret object is nice, but we want a more flexible
and robust classification testing suite built around the ROC specifically.
A key piece of an analysis is seeing how a model type performs on both training
and test data. EWStools makes this easy:
library(EWStools) mod.out <- modTest("lda2", datatype=c("train", "test"), traindata = list(preds = trainD[, -23], class = trainD[, 23]), testdata = list(preds = testD[, -23], class = testD[, 23]), modelKeep=FALSE, length = 5, fitControl = ctrl, metric = "ROC")
The modTest function allows the user to evaluate a method defined for the train
function in caret on both a training and a test set of data, and it reports
a number of model summary statistics for the model on both the training and the
test data. Future extensions will include support for validation data as
well.
While caret has made it very easy to build models, EWStools seeks to improve
on this by allowing multiple models to be built, evaluated, and compared in
a few easy easy functions. For example, to compare multiple models (without
storing them), the user may:
mod.out <- modSearch(methods = c("glm", "lda2"), datatype=c("train", "test"), traindata = list(preds = trainD[, -23], class = trainD[, 23]), testdata = list(preds = testD[, -23], class = testD[, 23]), modelKeep=FALSE, length = 5, fitControl = ctrl, metric = "ROC")
This produces a data frame with ROC statistics for the models that allow the construction of ROC curves.
library(ggplot2) ggplot(mod.out[mod.out$grp == "test",], aes(x = 1- spec, y = sens, group = method, color = method)) + geom_line() + theme_bw() + theme(legend.position = c(.8, .2))
We can do even more with model fit evaluation.
test1 <- ROCtest(fullModel) print(test1)
This tells us how the model fits the training data in a convenient and easy to interpret fashion. We get a sense of the area under the curve. We can also use the same function to evaluate performance on the test data using a named list of new data:
test2 <- ROCtest(fullModel, testdata = list(preds = testD[, -23], class = testD[, "Class"])) print(test2)
As we can see, the model performs less effectively on the test data. But, what if we want to compare many models?
test3 <- modAcc(fullModel, datatype = c("train","test"), testdata = list(preds = testD[, 1:22], class = testD[, 23])) summary(dfExtract(test3))
Sometimes we want to compare more theoretically driven generalized linear models to
the machine learning algorithms fit with the caret package. While the caret package
can certainly produce models using a glm, we can also compare glm objects directly.
glmModel <- glm(Class ~ . , data=trainD, family = binomial) testGLM <- ROCtest(glmModel)
print(testGLM)
And, we can use this to automate our process of inspecting the performance on test data as well:
testGLM2 <- ROCtest(glmModel, testdata = list(preds = testD[, -23], class = testD[, "Class"])) print(testGLM2)
And, we can extract results and generate dataframes of the performance profiles of
glm objects just as we can with train objects:
testGLM3 <- modAcc(glmModel, datatype =c("train", "test"), testdata = list(preds = testD[, 1:22], class = testD[, 23])) summary(dfExtract(testGLM3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.