OJ
data setThe OJ
data set from the ISLR
package contains information on which of two brands of orange juice customers purchased^[The response variable is Purchase
.] and can be loaded using
data(OJ, package = "ISLR")
\noindent After loading the caret
and jrPredictive
package
library("caret") library("jrPredictive")
\noindent make an initial examination of the relationships between each of the predictors and the response^[Use the plot
function with a model formula or the pairs
function.]
op = par(mfrow = c(4, 5), mar = c(4, 0.5, 0.5, 0.5)) plot(Purchase ~ ., data = OJ) par(op)
PriceCH
and PriceMM
. Hint: Use the train
function, with method = 'glm'
. Look at the help page for the data set to understand what these variables represent.m1 = train(Purchase ~ PriceCH + PriceMM, data = OJ, method = "glm")
mean(predict(m1) == OJ$Purchase)
# with no model we essentially predict according to # proportion of observations in data probs = table(OJ$Purchase) / nrow(OJ) preds = sample(levels(OJ$Purchase), prob = probs) mean(preds == OJ$Purchase)
The jrPredictive
package contains following code produces a plot of the decision boundary as seen in figure 1.
boundary_plot(m1, OJ$PriceCH, OJ$PriceMM, OJ$Purchase, xlab = "Price CH", ylab = "Price MM")
\noindent Run the boundary code above, and make sure you get a similar plot.
# We now have a curved decision boundary. # There are two regions of where we would predict MM, bottom left, #and a tiny one up in the top right.
m_log = train(Purchase ~ ., data = OJ, method = "glm")
## YES!
\noindent We can view the most recent warning messages by using the warnings
function
warnings()
\noindent This suggests some rank-deficient fit problems,
m_log$finalModel
\noindent The help page
?ISLR::OJ
\noindent gives further insight: the PriceDiff
variable is a linear combination of SalePriceMM
and SalePriceCH
so we should remove this. In addition the StoreID
and STORE
variable are different encodings of the same information so we should remove one of these too. We also have DiscCH
and DiscMM
which are the differences between PriceCH
and SalePriceCH
and PriceMM
and SalePriceMM
respectively and ListPriceDiff
is a linear combination of these prices. Removing all of these variables allows the model to be fit and all parameters to be estimated.^[This is to highlight that we need to understand what we have in our data.]
OJsub = OJ[, !(colnames(OJ) %in% c("STORE", "SalePriceCH", "SalePriceMM", "PriceDiff", "ListPriceDiff", "Store7"))] m_log = train(Purchase ~ ., data = OJsub, method = "glm")
\noindent The problem of linear combinations of predictors can be shown with this simple theoretical example. Suppose we have a response $y$ and three predictors $x_1$, $x_2$ and the linear combination $x_3 = (x_1 + x_2)$. On fitting a linear model we try to find estimates of the parameters in the equation
$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 + x_2).$$
\noindent However we could just as easily rewrite this as
\begin{align} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 + x_2) \ &= \beta_0 + (\beta_1 + \beta_3) x_1 + (\beta_2 + \beta_3) x_2 \ &= \beta_0 + \beta_1^{\ast} x_1 + \beta_2^{\ast} x_2. \end{align}
\noindent This leads to a rank-deficient model matrix, essentially we can never find the value of the $\beta_3$ due to the fact we have the linear combination of predictors.
We could achieve the same using the caret
package function findLinearCombos
. The function takes a model matrix as an argument. We can create such a matrix using the
model.matrix
function with our formula object
remove = findLinearCombos(model.matrix(Purchase ~ ., data = OJ))
\noindent The output list has a component called remove
suggesting which variables should be removed to get rid of linear combinations
(badvar = colnames(OJ)[remove$remove])
OJsub = OJ[, -remove$remove]
# the corrected model remove = findLinearCombos(model.matrix(Purchase~., data = OJ)) (badvar = colnames(OJ)[remove$remove])
OJsub = OJ[, - (remove$remove)] mLM = train(Purchase~., data = OJsub, method = "glm") mean(predict(mLM, OJsub) == OJsub$Purchase)
## could use confusionMatrix (cmLM = confusionMatrix(predict(mLM, OJsub), OJsub$Purchase))
# or sensitivity(predict(mLM, OJsub), OJsub$Purchase)
specificity(predict(mLM, OJsub), OJsub$Purchase)
#The model is fairly good at picking up both positive events, #person buys CH, and negative events, MM.
If we were interested in the area under the ROC curve, we could retrain the model using the twoClassSummary
function as an argument to a train control object. Alternatively we can
use the pROC
package
library("pROC")
\noindent This also allows us to view the ROC curve, via
curve = roc(response = OJsub$Purchase, predictor = predict(m_log, type = "prob")[, "CH"]) ## this makes CH the event of interest plot(curve, legacy.axes = TRUE)
boundary_plot
function to visualise the results.mKNN = train(Purchase~., data = OJsub, method = "knn") mLDA = train(Purchase~., data = OJsub, method = "lda") mQDA = train(Purchase~., data = OJsub, method = "qda") cmKNN = confusionMatrix(predict(mKNN, OJsub), OJsub$Purchase) cmLDA = confusionMatrix(predict(mLDA, OJsub), OJsub$Purchase) cmQDA = confusionMatrix(predict(mQDA, OJsub), OJsub$Purchase) (info = data.frame(Model = c("logistic", "knn", "lda", "qda"), Accuracy = c(cmLM$overall["Accuracy"], cmKNN$overall["Accuracy"], cmLDA$overall["Accuracy"], cmQDA$overall["Accuracy"]), Sensitivity = c(cmLM$byClass["Sensitivity"], cmKNN$byClass["Sensitivity"], cmLDA$byClass["Sensitivity"], cmQDA$byClass["Sensitivity"]), Specificity = c(cmLM$byClass["Specificity"], cmKNN$byClass["Specificity"], cmLDA$byClass["Specificity"], cmQDA$byClass["Specificity"])))
How do they compare?
How does varying the number of nearest neighbours in a KNN affect the model fit?
#Accuracy increases at first with knn before then getting worse #after a peak value of 9. (mKNN2 = train(Purchase~., data = OJsub, method = "knn", tuneGrid = data.frame(k = 1:30)))
\noindent The KNN algorithm described in the notes can also be used for regression problems. In this case the predicted response is the mean of the $k$ nearest neighbours.
library("jrPredictive") data(FuelEconomy, package = "AppliedPredictiveModeling") regKNN = train(FE~., data = cars2010, method = "knn") regLM = train(FE~., data = cars2010, method = "lm")
The Glass
data set in the mlbench
package is a data frame containing examples of the chemical analysis of $7$ different types of glass. The goal is to be able to predict which category glass falls into based on the values of the $9$ predictors.
data(Glass, package = "mlbench")
\noindent A logistic regression model is typically not suitable for more than $2$ classes, so try fitting the other models using a training set that consists of 90\% of the available data.
This section is intended for users who have a more in depth background to R programming. Attendance to the Programming in R course should be adequate background.
So far we have only used default functions and metrics to compare the performance of models, however we are not restricted to doing this. For example, training of classification models is typically more difficult when there is an imbalance in the two classes in the training set. Models trained from such data typically have high specificity but poor sensitivity or vice versa. Instead of training to maximise accuracy using data from the training set we could try to maximise according to some other criteria, namely sensitivity and specificity being as close to perfect as possible $(1, 1)$.
To add our function we need to make sure we mirror the structure of those included in caret already.We can view a functions code by typing its name with no brackets. The following code creates a new function that could be used to summarise a model
fourStats = function(data, lev = NULL, model = NULL) { # This code will use the area under the ROC curve and the # sensitivity and specificity values from the built in # twoClassSummary function out = twoClassSummary(data, lev = levels(data$obs), model = NULL) # The best possible model has sensitivity of 1 and # specifity of 1. How far are we from that value? coords = matrix(c(1, 1, out["Spec"], out["Sens"]), ncol = 2, byrow = TRUE) # return the distance measure together with the output # from two class summary c(Dist = dist(coords)[1], out) }
\noindent we could then use this in the train
function
data(Sonar, package = "mlbench") mod = train(Class ~ ., data = Sonar, method = "knn", # Minimize the distance to the perfect model metric = "Dist", maximize = FALSE, tuneLength = 20, trControl = trainControl(method = "cv", classProbs = TRUE, summaryFunction = fourStats))
\noindent The plot
function
plot(mod)
\noindent will then show the profile of the resampling estimates of our chosen statistic against the tuning parameters, see figure 2.
maxabsres = function(data, lev = NULL, model = NULL) { m = max(abs(data$obs - data$pred)) return(c("Max" = m)) } # Test with pls regression tccustom = trainControl(method = "cv", summaryFunction = maxabsres) mPLScustom = train(FE~., data = cars2010, method = "pls", tuneGrid = data.frame(ncomp = 1:6), trControl = tccustom, metric = "Max", maximize = FALSE) # success # not to suggest this is a good choice of metric
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.