library(knitr) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE )
The EWStools
package is designed to automate many of the model building and
checking tests that are common in predictive analytics, specifically for the
education domain. We will focus on this suite of tools around the ROC classification
procedure. To do this, let's work with a simple simulated dataset we can
construct using the twoClassSim
function in the caret
package:
set.seed(442) library(caret); library(MASS); library(pROC) trainD <- twoClassSim(n = 5000, intercept = -8, linearVars = 3, noiseVars = 10, corrVars = 4, corrValue = 0.6) testD <- twoClassSim(n = 1500, intercept = -7, linearVars = 3, noiseVars = 10, corrVars = 4, corrValue = 0.6)
Let's see what this produces
head(trainD[, c(1:5, 20:23)]) table(trainD$Class)
Our training data has an imbalanced class structure and 22 predictors that are scaled and centered.
A key thing to note is that the train
and test
data have the exact same
variable names and scales:
names(trainD) names(testD)
Now let's build a model:
ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary) fullModel <- train(Class ~ . , data = trainD, method = "knn", preProc = c("center", "scale"), tuneLength = 5, metric = "ROC", trControl = ctrl) fullModel
The standard output from the caret
object is nice, but we want a more flexible
and robust classification testing suite built around the ROC specifically.
A key piece of an analysis is seeing how a model type performs on both training
and test data. EWStools
makes this easy:
library(EWStools) mod.out <- modTest("lda2", datatype=c("train", "test"), traindata = list(preds = trainD[, -23], class = trainD[, 23]), testdata = list(preds = testD[, -23], class = testD[, 23]), modelKeep=FALSE, length = 5, fitControl = ctrl, metric = "ROC")
The modTest
function allows the user to evaluate a method defined for the train
function in caret
on both a training and a test set of data, and it reports
a number of model summary statistics for the model on both the training and the
test data. Future extensions will include support for validation data as
well.
While caret
has made it very easy to build models, EWStools
seeks to improve
on this by allowing multiple models to be built, evaluated, and compared in
a few easy easy functions. For example, to compare multiple models (without
storing them), the user may:
mod.out <- modSearch(methods = c("glm", "lda2"), datatype=c("train", "test"), traindata = list(preds = trainD[, -23], class = trainD[, 23]), testdata = list(preds = testD[, -23], class = testD[, 23]), modelKeep=FALSE, length = 5, fitControl = ctrl, metric = "ROC")
This produces a data frame with ROC statistics for the models that allow the construction of ROC curves.
library(ggplot2) ggplot(mod.out[mod.out$grp == "test",], aes(x = 1- spec, y = sens, group = method, color = method)) + geom_line() + theme_bw() + theme(legend.position = c(.8, .2))
We can do even more with model fit evaluation.
test1 <- ROCtest(fullModel) print(test1)
This tells us how the model fits the training data in a convenient and easy to interpret fashion. We get a sense of the area under the curve. We can also use the same function to evaluate performance on the test data using a named list of new data:
test2 <- ROCtest(fullModel, testdata = list(preds = testD[, -23], class = testD[, "Class"])) print(test2)
As we can see, the model performs less effectively on the test data. But, what if we want to compare many models?
test3 <- modAcc(fullModel, datatype = c("train","test"), testdata = list(preds = testD[, 1:22], class = testD[, 23])) summary(dfExtract(test3))
Sometimes we want to compare more theoretically driven generalized linear models to
the machine learning algorithms fit with the caret
package. While the caret
package
can certainly produce models using a glm, we can also compare glm
objects directly.
glmModel <- glm(Class ~ . , data=trainD, family = binomial) testGLM <- ROCtest(glmModel)
print(testGLM)
And, we can use this to automate our process of inspecting the performance on test data as well:
testGLM2 <- ROCtest(glmModel, testdata = list(preds = testD[, -23], class = testD[, "Class"])) print(testGLM2)
And, we can extract results and generate dataframes of the performance profiles of
glm
objects just as we can with train
objects:
testGLM3 <- modAcc(glmModel, datatype =c("train", "test"), testdata = list(preds = testD[, 1:22], class = testD[, 23])) summary(dfExtract(testGLM3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.