OJ
data setThe OJ
data set from the ISLR
package contains information on which of two brands of orange juice customers purchased^[The response variable is Purchase
.] and can be loaded using
data(OJ, package = "ISLR")
\noindent After loading the caret
and jrPred
package
library("caret") library("jrPred")
\noindent make an initial examination of the relationships between each of the predictors and the response^[Use the plot
function with a model formula or the pairs
function.]
par(mfrow = c(4, 5), mar = c(4, 0.5, 0.5, 0.5)) plot(Purchase ~ ., data = OJ)
PriceCH
and PriceMM
. Hint: Use the train
function, with method = 'glm'
. Look at the help page for the data set to understand what these variables represent.m1 = train(Purchase ~ PriceCH + PriceMM, data = OJ, method = "glm")
getTrainPerf(m1)
# with no model we essentially predict according to # proportion of observations in data # work out proportions probs = table(OJ$Purchase)/nrow(OJ) # sample using proportions preds = sample(levels(OJ$Purchase), prob = probs) # work out correct proportion mean(preds != OJ$Purchase)
predict(m1, newdata = data.frame(PriceCH = 2.3, PriceMM = 2.4))
The jrPred
package contains following code produces a plot of the decision boundary as seen in figure 1.
boundary_plot(m1,OJ$PriceCH, OJ$PriceMM, OJ$Purchase, xlab="Price CH", ylab="Price MM")
\noindent Run the boundary code above, and make sure you get a similar plot.
# We now have a curved decision boundary. # There are two regions of where we would predict MM, bottom left, and a tiny one up in the top right.
mLM = train(Purchase ~ ., data = OJ, method = "glm")
Take the predictor PriceDiff. It is impossible to estimate it's coefficient as it is a linear combination of PriceCH and PriceMM i.e. PriceDiff = PriceCH - PriceMM
. In this particularly data set, there are quite a few linear combinations. We can find them using the findLinearCombos()
and model.matrix()
functions
remove = findLinearCombos(model.matrix(Purchase ~ ., data = OJ))
\noindent The output list has a component called remove
suggesting which variables should be removed to get rid of linear combinations
(badvar = colnames(OJ)[remove$remove])
We can then remove these variable from the data
OJsub = OJ[, -remove$remove]
OJsub
data set to model Purchase
using all of the predictors. How accurate is the model?mLM = train(Purchase~., data = OJsub, method = "glm") getTrainPerf(mLM)
## could use confusionMatrix (cmLM = confusionMatrix(predict(mLM,OJsub),OJsub$Purchase))
# or sensitivity(predict(mLM,OJsub),OJsub$Purchase) specificity(predict(mLM,OJsub),OJsub$Purchase)
#The model is fairly good at picking up both positive events, person buys CH, and negative events, MM.
boundary_plot
function to visualise the results.mKNN = train(Purchase~., data = OJsub, method = "knn")
cmKNN = confusionMatrix(predict(mKNN,OJsub),OJsub$Purchase) (info = data.frame(Model = c("logistic","knn"), Accuracy = c(cmLM$overall["Accuracy"], cmKNN$overall["Accuracy"]), Sensitivity = c(cmLM$byClass["Sensitivity"], cmKNN$byClass["Sensitivity"]), Specificity = c(cmLM$byClass["Specificity"], cmKNN$byClass["Specificity"])))
# Accuracy increases at first with knn before then getting worse after a peak value of 9. (mKNN2 = train(Purchase~., data = OJsub, method = "knn", tuneGrid = data.frame(k = 1:30))) plot(mKNN2)
\noindent The KNN algorithm described in the notes can also be used for regression problems. In this case the predicted response is the mean of the $k$ nearest neighbours.
k
parameter to try to find a better model. library("jrPred") data(FuelEconomy, package = "AppliedPredictiveModeling") regKNN = train(FE~., data = cars2010, method = "knn", tuneGrid = data.frame(k = c(1:5,10,20,50,100))) regLM = train(FE~., data = cars2010, method = "lm") getTrainPerf(regKNN) getTrainPerf(regLM)
The Glass
data set in the mlbench
package is a data frame containing examples of the chemical analysis of $7$ different types of glass. The goal is to be able to predict which category glass falls into based on the values of the $9$ predictors.
data(Glass, package = "mlbench")
\noindent A logistic regression model is typically not suitable for more than $2$ classes, so try fitting a k nearest neighbour model. Use k-fold cross validation is you want to. What proportion of predictions does your model get correct?
tc = trainControl(method = "cv", number = 10) model = train(Type ~ ., data = Glass, trControl = tc, method = "knn") getTrainPerf(model)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.