In jr-packages/jrPredictive: Jumping Rivers: Machine Learning with R

Predictive Analytics: practical 3

The `OJ` data set

The OJ data set from the ISLR package contains information on which of two brands of orange juice customers purchased^[The response variable is Purchase.] and can be loaded using

data(OJ, package = "ISLR")

\noindent After loading the caret and jrPredictive package

library("caret")
library("jrPredictive")

\noindent make an initial examination of the relationships between each of the predictors and the response^[Use the plot function with a model formula or the pairs function.]

op = par(mfrow = c(4, 5), mar = c(4, 0.5, 0.5, 0.5))
plot(Purchase ~ ., data = OJ)
par(op)

Initial model building

To begin, create a logistic regression model that takes into consideration the prices of the two brands of orange juice, PriceCH and PriceMM. Hint: Use the train function, with method = 'glm'. Look at the help page for the data set to understand what these variables represent.

m1 = train(Purchase ~ PriceCH + PriceMM,
    data = OJ, method = "glm")

What proportion of purchases does this model get right?

mean(predict(m1) == OJ$Purchase)

How does this compare to if we used no model?

# with no model we essentially predict according to
# proportion of observations in data
probs = table(OJ$Purchase) / nrow(OJ)
preds = sample(levels(OJ$Purchase), prob = probs)
mean(preds == OJ$Purchase)

Visualising the boundary

The jrPredictive package contains following code produces a plot of the decision boundary as seen in figure 1.

boundary_plot(m1, OJ$PriceCH, OJ$PriceMM, OJ$Purchase,
              xlab = "Price CH", ylab = "Price MM")

\noindent Run the boundary code above, and make sure you get a similar plot.

What happens if we add an interaction term? How does the boundary change?

# We now have a curved decision boundary.
# There are two regions of where we would predict MM, bottom left,
#and a tiny one up in the top right.

Try adding polynomial terms.

Using all of the predictors

Fit a logistic regression model using all of the predictors.

m_log = train(Purchase ~ ., data = OJ, method = "glm")

Is there a problem?

## YES!

\noindent We can view the most recent warning messages by using the warnings function

warnings()

\noindent This suggests some rank-deficient fit problems,

Look at the final model, you should notice that a number of parameters have not been estimated

m_log$finalModel

\noindent The help page

?ISLR::OJ

\noindent gives further insight: the PriceDiff variable is a linear combination of SalePriceMM and SalePriceCH so we should remove this. In addition the StoreID and STORE variable are different encodings of the same information so we should remove one of these too. We also have DiscCH and DiscMM which are the differences between PriceCH and SalePriceCH and PriceMM and SalePriceMM respectively and ListPriceDiff is a linear combination of these prices. Removing all of these variables allows the model to be fit and all parameters to be estimated.^[This is to highlight that we need to understand what we have in our data.]

OJsub = OJ[, !(colnames(OJ) %in% c("STORE", "SalePriceCH",
    "SalePriceMM", "PriceDiff", "ListPriceDiff", "Store7"))]
m_log = train(Purchase ~ ., data = OJsub, method = "glm")

\noindent The problem of linear combinations of predictors can be shown with this simple theoretical example. Suppose we have a response $y$ and three predictors $x_1$, $x_2$ and the linear combination $x_3 = (x_1 + x_2)$. On fitting a linear model we try to find estimates of the parameters in the equation

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 + x_2).$$

\noindent However we could just as easily rewrite this as

\begin{align} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 + x_2) \ &= \beta_0 + (\beta_1 + \beta_3) x_1 + (\beta_2 + \beta_3) x_2 \ &= \beta_0 + \beta_1^{\ast} x_1 + \beta_2^{\ast} x_2. \end{align}

\noindent This leads to a rank-deficient model matrix, essentially we can never find the value of the $\beta_3$ due to the fact we have the linear combination of predictors.

We could achieve the same using the caret package function findLinearCombos. The function takes a model matrix as an argument. We can create such a matrix using the model.matrix function with our formula object

remove = findLinearCombos(model.matrix(Purchase ~ ., data = OJ))

\noindent The output list has a component called remove suggesting which variables should be removed to get rid of linear combinations

(badvar = colnames(OJ)[remove$remove])

OJsub = OJ[, -remove$remove]

How accurate is this new model using more predictors?

# the corrected model
remove = findLinearCombos(model.matrix(Purchase~., data = OJ))
(badvar = colnames(OJ)[remove$remove])

OJsub = OJ[, - (remove$remove)]
mLM = train(Purchase~., data = OJsub, method = "glm")
mean(predict(mLM, OJsub) == OJsub$Purchase)

What are the values of sensitivity and specificity?

## could use confusionMatrix
(cmLM = confusionMatrix(predict(mLM, OJsub), OJsub$Purchase))

# or
sensitivity(predict(mLM, OJsub), OJsub$Purchase)

specificity(predict(mLM, OJsub), OJsub$Purchase)

What does this mean?

#The model is fairly good at picking up both positive events,
#person buys CH, and negative events, MM.

ROC curves

If we were interested in the area under the ROC curve, we could retrain the model using the twoClassSummary function as an argument to a train control object. Alternatively we can use the pROC package

library("pROC")

\noindent This also allows us to view the ROC curve, via

curve = roc(response = OJsub$Purchase,
  predictor = predict(m_log, type = "prob")[, "CH"])
## this makes CH the event of interest
plot(curve, legacy.axes = TRUE)

Other classification models

Try fitting models using the other classification algorithms we have seen so far. To begin with, just have two covariates and use the boundary_plot function to visualise the results.

mKNN = train(Purchase~., data = OJsub, method = "knn")
mLDA = train(Purchase~., data = OJsub, method = "lda")
mQDA = train(Purchase~., data = OJsub, method = "qda")
cmKNN = confusionMatrix(predict(mKNN, OJsub), OJsub$Purchase)
cmLDA = confusionMatrix(predict(mLDA, OJsub), OJsub$Purchase)
cmQDA = confusionMatrix(predict(mQDA, OJsub), OJsub$Purchase)
(info = data.frame(Model = c("logistic", "knn", "lda", "qda"),
           Accuracy = c(cmLM$overall["Accuracy"],
               cmKNN$overall["Accuracy"],
               cmLDA$overall["Accuracy"],
               cmQDA$overall["Accuracy"]),
           Sensitivity = c(cmLM$byClass["Sensitivity"],
               cmKNN$byClass["Sensitivity"],
               cmLDA$byClass["Sensitivity"],
               cmQDA$byClass["Sensitivity"]),
           Specificity = c(cmLM$byClass["Specificity"],
               cmKNN$byClass["Specificity"],
               cmLDA$byClass["Specificity"],
               cmQDA$byClass["Specificity"])))

How do they compare?
How does varying the number of nearest neighbours in a KNN affect the model fit?

#Accuracy increases at first with knn before then getting worse
#after a peak value of 9.
(mKNN2 = train(Purchase~., data = OJsub, method = "knn",
    tuneGrid = data.frame(k = 1:30)))

\noindent The KNN algorithm described in the notes can also be used for regression problems. In this case the predicted response is the mean of the $k$ nearest neighbours.

Try fitting the KNN model for the regression problem in practical 1.

library("jrPredictive")
data(FuelEconomy, package = "AppliedPredictiveModeling")
regKNN = train(FE~., data = cars2010, method = "knn")
regLM = train(FE~., data = cars2010, method = "lm")

How does this compare to the linear regression models?

An example with more than two classes

The Glass data set in the mlbench package is a data frame containing examples of the chemical analysis of $7$ different types of glass. The goal is to be able to predict which category glass falls into based on the values of the $9$ predictors.

data(Glass, package = "mlbench")

\noindent A logistic regression model is typically not suitable for more than $2$ classes, so try fitting the other models using a training set that consists of 90\% of the available data.

Advanced

This section is intended for users who have a more in depth background to R programming. Attendance to the Programming in R course should be adequate background.

So far we have only used default functions and metrics to compare the performance of models, however we are not restricted to doing this. For example, training of classification models is typically more difficult when there is an imbalance in the two classes in the training set. Models trained from such data typically have high specificity but poor sensitivity or vice versa. Instead of training to maximise accuracy using data from the training set we could try to maximise according to some other criteria, namely sensitivity and specificity being as close to perfect as possible $(1, 1)$.

To add our function we need to make sure we mirror the structure of those included in caret already.We can view a functions code by typing its name with no brackets. The following code creates a new function that could be used to summarise a model

fourStats = function(data, lev = NULL, model = NULL) {
    # This code will use the area under the ROC curve and the
    # sensitivity and specificity values from the built in
    # twoClassSummary function
    out = twoClassSummary(data, lev = levels(data$obs), model = NULL)
    # The best possible model has sensitivity of 1 and
    # specifity of 1. How far are we from that value?
    coords = matrix(c(1, 1, out["Spec"], out["Sens"]), ncol = 2,
        byrow = TRUE)
    # return the distance measure together with the output
    # from two class summary
    c(Dist = dist(coords)[1], out)
}

\noindent we could then use this in the train function

data(Sonar, package = "mlbench")
mod = train(Class ~ ., data = Sonar,
              method = "knn",
              # Minimize the distance to the perfect model
              metric = "Dist",
              maximize = FALSE,
              tuneLength = 20,
              trControl =
    trainControl(method = "cv", classProbs = TRUE,
                     summaryFunction = fourStats))

\noindent The plot function

plot(mod)

\noindent will then show the profile of the resampling estimates of our chosen statistic against the tuning parameters, see figure 2.

Have a go at writing a function that will allow a regression model to be chosen by the absolute value of the largest residual and try using it to fit a couple of models.

maxabsres = function(data, lev = NULL, model = NULL) {
    m = max(abs(data$obs - data$pred))
    return(c("Max" = m))
}
# Test with pls regression
tccustom = trainControl(method = "cv",
                     summaryFunction = maxabsres)
mPLScustom = train(FE~., data = cars2010,
                   method = "pls",
               tuneGrid = data.frame(ncomp = 1:6),
               trControl = tccustom,
               metric = "Max", maximize = FALSE)
# success
# not to suggest this is a good choice of metric