Predictive Analytics: practical 3

The OJ data set

The OJ data set from the ISLR package contains information on which of two brands of orange juice customers purchased^[The response variable is Purchase.] and can be loaded using

data(OJ, package = "ISLR")

\noindent After loading the caret and jrPred package

library("caret")
library("jrPred")

\noindent make an initial examination of the relationships between each of the predictors and the response^[Use the plot function with a model formula or the pairs function.]

par(mfrow = c(4, 5), mar = c(4, 0.5, 0.5, 0.5))
plot(Purchase ~ ., data = OJ)

Initial model building using logistic regression

m1 = train(Purchase ~ PriceCH + PriceMM,
    data = OJ, method = "glm")
getTrainPerf(m1)
# with no model we essentially predict according to
# proportion of observations in data

# work out proportions
probs = table(OJ$Purchase)/nrow(OJ)
# sample using proportions
preds = sample(levels(OJ$Purchase), prob = probs)
# work out correct proportion
mean(preds != OJ$Purchase)
predict(m1, newdata = data.frame(PriceCH = 2.3, PriceMM = 2.4))

Visualising the boundary

The jrPred package contains following code produces a plot of the decision boundary as seen in figure 1.

boundary_plot(m1,OJ$PriceCH, OJ$PriceMM, OJ$Purchase,
              xlab="Price CH", ylab="Price MM")

\noindent Run the boundary code above, and make sure you get a similar plot.

# We now have a curved decision boundary.
# There are two regions of where we would predict MM, bottom left, and a tiny one up in the top right.

Using all of the predictors

mLM = train(Purchase ~ ., data = OJ, method = "glm")

Take the predictor PriceDiff. It is impossible to estimate it's coefficient as it is a linear combination of PriceCH and PriceMM i.e. PriceDiff = PriceCH - PriceMM. In this particularly data set, there are quite a few linear combinations. We can find them using the findLinearCombos() and model.matrix() functions

remove = findLinearCombos(model.matrix(Purchase ~ ., data = OJ))

\noindent The output list has a component called remove suggesting which variables should be removed to get rid of linear combinations

(badvar = colnames(OJ)[remove$remove])

We can then remove these variable from the data

OJsub = OJ[, -remove$remove]
mLM = train(Purchase~., data = OJsub, method = "glm")
getTrainPerf(mLM)
## could use confusionMatrix
(cmLM = confusionMatrix(predict(mLM,OJsub),OJsub$Purchase))
# or
sensitivity(predict(mLM,OJsub),OJsub$Purchase)
specificity(predict(mLM,OJsub),OJsub$Purchase)
#The model is fairly good at picking up both positive events, person buys CH, and negative events, MM.

K nearest neigbours

mKNN = train(Purchase~., data = OJsub, method = "knn")
cmKNN = confusionMatrix(predict(mKNN,OJsub),OJsub$Purchase)
(info = data.frame(Model = c("logistic","knn"),
           Accuracy = c(cmLM$overall["Accuracy"],
               cmKNN$overall["Accuracy"]),
           Sensitivity = c(cmLM$byClass["Sensitivity"],
               cmKNN$byClass["Sensitivity"]),
           Specificity = c(cmLM$byClass["Specificity"],
               cmKNN$byClass["Specificity"])))
# Accuracy increases at first with knn before then getting worse after a peak value of 9.
(mKNN2 = train(Purchase~., data = OJsub, method = "knn",
    tuneGrid = data.frame(k = 1:30)))
plot(mKNN2)

\noindent The KNN algorithm described in the notes can also be used for regression problems. In this case the predicted response is the mean of the $k$ nearest neighbours.

library("jrPred")
data(FuelEconomy, package = "AppliedPredictiveModeling")
regKNN = train(FE~., data = cars2010, method = "knn", tuneGrid = data.frame(k = c(1:5,10,20,50,100)))
regLM = train(FE~., data = cars2010, method = "lm")
getTrainPerf(regKNN)
getTrainPerf(regLM)

An example with more than two classes

The Glass data set in the mlbench package is a data frame containing examples of the chemical analysis of $7$ different types of glass. The goal is to be able to predict which category glass falls into based on the values of the $9$ predictors.

data(Glass, package = "mlbench")

\noindent A logistic regression model is typically not suitable for more than $2$ classes, so try fitting a k nearest neighbour model. Use k-fold cross validation is you want to. What proportion of predictions does your model get correct?

tc = trainControl(method = "cv", number = 10)
model = train(Type ~ ., data = Glass, trControl = tc, method = "knn")
getTrainPerf(model)


jr-packages/jrPred documentation built on May 6, 2019, 7:17 a.m.