In jr-packages/jrPred: Predictive Analytics 1 day by Jumping Rivers

Predictive Analytics: practical 3

The `OJ` data set

The OJ data set from the ISLR package contains information on which of two brands of orange juice customers purchased^[The response variable is Purchase.] and can be loaded using

data(OJ, package = "ISLR")

\noindent After loading the caret and jrPred package

library("caret")
library("jrPred")

\noindent make an initial examination of the relationships between each of the predictors and the response^[Use the plot function with a model formula or the pairs function.]

par(mfrow = c(4, 5), mar = c(4, 0.5, 0.5, 0.5))
plot(Purchase ~ ., data = OJ)

Initial model building using logistic regression

To begin, create a logistic regression model that takes into consideration the prices of the two brands of orange juice, PriceCH and PriceMM. Hint: Use the train function, with method = 'glm'. Look at the help page for the data set to understand what these variables represent.

m1 = train(Purchase ~ PriceCH + PriceMM,
    data = OJ, method = "glm")

What proportion of purchases does this model get right?

getTrainPerf(m1)

How does this compare to if we used no model?

# with no model we essentially predict according to
# proportion of observations in data

# work out proportions
probs = table(OJ$Purchase)/nrow(OJ)
# sample using proportions
preds = sample(levels(OJ$Purchase), prob = probs)
# work out correct proportion
mean(preds != OJ$Purchase)

Use your model to predict if a customer will buy CH or MM if the price of CH and MM is 2.3 and 2.4 respectively

predict(m1, newdata = data.frame(PriceCH = 2.3, PriceMM = 2.4))

Visualising the boundary

The jrPred package contains following code produces a plot of the decision boundary as seen in figure 1.

boundary_plot(m1,OJ$PriceCH, OJ$PriceMM, OJ$Purchase,
              xlab="Price CH", ylab="Price MM")

\noindent Run the boundary code above, and make sure you get a similar plot.

What happens if we add an interaction term? How does the boundary change?

# We now have a curved decision boundary.
# There are two regions of where we would predict MM, bottom left, and a tiny one up in the top right.

Try adding polynomial terms.

Using all of the predictors

Instead of just using 2 predictors we want to use all of them. However, we have a few problems to tackle first. A few of our predictors are linear combinations of the others. This leads to what is called rank-deficiency problems. For instance, if you run the following model you'll realise there are a few NAs.

mLM = train(Purchase ~ ., data = OJ, method = "glm")

Take the predictor PriceDiff. It is impossible to estimate it's coefficient as it is a linear combination of PriceCH and PriceMM i.e. PriceDiff = PriceCH - PriceMM. In this particularly data set, there are quite a few linear combinations. We can find them using the findLinearCombos() and model.matrix() functions

remove = findLinearCombos(model.matrix(Purchase ~ ., data = OJ))

\noindent The output list has a component called remove suggesting which variables should be removed to get rid of linear combinations

(badvar = colnames(OJ)[remove$remove])

We can then remove these variable from the data

OJsub = OJ[, -remove$remove]

Use the new OJsub data set to model Purchase using all of the predictors. How accurate is the model?

mLM = train(Purchase~., data = OJsub, method = "glm")
getTrainPerf(mLM)

What are the values of sensitivity and specificity?

## could use confusionMatrix
(cmLM = confusionMatrix(predict(mLM,OJsub),OJsub$Purchase))

# or
sensitivity(predict(mLM,OJsub),OJsub$Purchase)
specificity(predict(mLM,OJsub),OJsub$Purchase)

What does this mean?

#The model is fairly good at picking up both positive events, person buys CH, and negative events, MM.

K nearest neigbours

Try fitting models using the K nearest neighbours algorithm. To begin with, just have two covariates and use the boundary_plot function to visualise the results.

mKNN = train(Purchase~., data = OJsub, method = "knn")

How do they compare in accuracy, sensitivity and specificity?

cmKNN = confusionMatrix(predict(mKNN,OJsub),OJsub$Purchase)
(info = data.frame(Model = c("logistic","knn"),
           Accuracy = c(cmLM$overall["Accuracy"],
               cmKNN$overall["Accuracy"]),
           Sensitivity = c(cmLM$byClass["Sensitivity"],
               cmKNN$byClass["Sensitivity"]),
           Specificity = c(cmLM$byClass["Specificity"],
               cmKNN$byClass["Specificity"])))

How does varying the number of nearest neighbours in a KNN affect the model fit?

# Accuracy increases at first with knn before then getting worse after a peak value of 9.
(mKNN2 = train(Purchase~., data = OJsub, method = "knn",
    tuneGrid = data.frame(k = 1:30)))
plot(mKNN2)

\noindent The KNN algorithm described in the notes can also be used for regression problems. In this case the predicted response is the mean of the $k$ nearest neighbours.

Try fitting the KNN model for the regression problem in practical 1. You could vary the k parameter to try to find a better model.

library("jrPred")
data(FuelEconomy, package = "AppliedPredictiveModeling")
regKNN = train(FE~., data = cars2010, method = "knn", tuneGrid = data.frame(k = c(1:5,10,20,50,100)))
regLM = train(FE~., data = cars2010, method = "lm")
getTrainPerf(regKNN)
getTrainPerf(regLM)

How does this compare to the linear regression models?

An example with more than two classes

The Glass data set in the mlbench package is a data frame containing examples of the chemical analysis of $7$ different types of glass. The goal is to be able to predict which category glass falls into based on the values of the $9$ predictors.

data(Glass, package = "mlbench")

\noindent A logistic regression model is typically not suitable for more than $2$ classes, so try fitting a k nearest neighbour model. Use k-fold cross validation is you want to. What proportion of predictions does your model get correct?

tc = trainControl(method = "cv", number = 10)
model = train(Type ~ ., data = Glass, trControl = tc, method = "knn")
getTrainPerf(model)

jr-packages/jrPred documentation built on May 6, 2019, 7:17 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jr-packages/jrPred
Predictive Analytics 1 day by Jumping Rivers

In jr-packages/jrPred: Predictive Analytics 1 day by Jumping Rivers

Predictive Analytics: practical 3

The `OJ` data set

Initial model building using logistic regression

Visualising the boundary

Using all of the predictors

K nearest neigbours

An example with more than two classes

R Package Documentation

Browse R Packages

We want your feedback!

jr-packages/jrPred Predictive Analytics 1 day by Jumping Rivers

In jr-packages/jrPred: Predictive Analytics 1 day by Jumping Rivers

Predictive Analytics: practical 3

The OJ data set

Initial model building using logistic regression

Visualising the boundary

Using all of the predictors

K nearest neigbours

An example with more than two classes

R Package Documentation

Browse R Packages

We want your feedback!

jr-packages/jrPred
Predictive Analytics 1 day by Jumping Rivers

The `OJ` data set