Vignette written in 2017, small update in 2022.
This is an example of how we can use sparse Partial Least Squares for modelling. The model is implemented in the mixOmics
package (http://mixomics.org/). The package has some cross-validation functionality but this is not nearly as rich as what the caret
package offers. But luckily, it can be added to the caret
ecosystem to take advantage of all these nice features.
This small utility package does just that and can be installed from github using devtools::install_github("jonathanth/mixOmicsCaret")
Let's look at how to use it.
First, we load some required packages.
library(mlbench) data(Sonar) library(doMC) library(caret) library(pROC) library(mixOmics) library(tidyverse) library(mixOmicsCaret) # This package options(width = 90)
Then we can register doMC for the built-in multithreading in caret
. A common option is number of threads minus one. On a quad-core with hyperthreading (Macbook Pro i7), that's 7.
registerDoMC(cores = 7)
We start by examining the dummy data we'll be using. It's called Sonar and comes from the mlbench
package.
head(Sonar) Sonar$Class dim(Sonar)
Alright, 60 predictors, 208 observations, 2 classes. Let's partition the data into a training set and a test set.
set.seed(123) inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE) training <- Sonar[ inTraining,] testing <- Sonar[-inTraining,]
Now we will setup a cross validation to see how many predictors we should include. First, let's set up the options. It will run 5 repeats of 10-fold CV. We will keep the resamples and the predictions, and allow it to run in parallel. We will turn off verbose-mode for now.
We save this as a trainControl object, which is just a set of options for caret
.
repCV10 <- trainControl(method = "repeatedcv", number = 10, repeats = 5, returnResamp = "all", savePredictions = "all", allowParallel = T, verboseIter = F)
Now for the CV loop itself. This will take some time. We have asked it to model Class M as a numeric 0/1, which is fine for spls but caret
throws a warning we can safely ignore. We will try 1 to 5 components, with a range of keepX values. This is the number of predictors allowed per component. Note the fixX
argument, which is left empty for now. This allows us to manually specify number of predictors per component.
set.seed(123) SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training, method = get_mixOmics_spls(), preProc = c("center", "scale"), tuneGrid = expand.grid(ncomp = 1:5, keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), keepY = 1), trControl = repCV10, fixX = c())
Let's inspect the results!
SonarPLS ggplot(SonarPLS, metric = "RMSE") ggplot(SonarPLS, metric = "Rsquared")
We can also manually extract the predictions, calculate AUC, and plot it:
SonarPLS$pred %>% separate(Resample, c("Fold", "Rep")) %>% group_by(ncomp, keepX, Rep) %>% summarize(auc = pROC::auc(predictor = pred, obs, direction = "<") %>% as.numeric) %>% ggplot(aes(x = factor(keepX), y = auc)) + geom_boxplot(outlier.shape = NA) + geom_jitter(aes(color = factor(ncomp)), alpha = 0.2, show.legend = FALSE) + geom_hline(yintercept = 0.5) + facet_grid(. ~ ncomp) + scale_color_brewer(palette = "Set1", name = NULL) + theme_bw() + theme(strip.background = element_blank()) + xlab("Number of variables")
Looks like the first component saturates at 25 predictors. Let's use the fixX
parameter to force comp 1 to have 25 predictors and let the following components try out all combinations. This allows us to flexibly set individual parameter numbers for each component. Lets re-run the cross-validation with this option.
set.seed(123) SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training, method = get_mixOmics_spls(), preProc = c("center", "scale"), tuneGrid = expand.grid(ncomp = 1:5, keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), keepY = 1), trControl = repCV10, fixX = c(25)) ggplot(SonarPLS, metric = "Rsquared")
Alright, we can do a little bit better if we add another component with 60 variables. But after that, the model starts overfitting and the CV-error increases / Rsquared decreases.
Looks like we have found our final model! Lets retrieve the predicted values, that we can use for unbiased downstream analyses. We want an average across the repeats to ensure we didn't just get lucky with the fold divisions. Since we have a test set, we can also look at the test set performance of the final model, using the predict
function on the SonarPLS
object with the testing
dataframe as new data.
# Retrieve the predictions from the optimal model bestSet <- SonarPLS %>% get_best_predictions %>% group_by(Rep) head(bestSet) bestSet %>% summarize(auc = pROC::auc(obs, pred, direction = "<") %>% as.numeric) %>% arrange(auc) savedPreds <- SonarPLS %>% get_best_predictions %>% filter(Rep == "Rep3") head(savedPreds) # Best CV predictions cvauc <- auc(obs ~ pred, data = savedPreds, direction = "<") qplot(ifelse(obs == 1, "M", "R"), pred, data = savedPreds, geom = c("boxplot", "jitter")) + xlab("Training class") + ylab("Training predictions") + ggtitle(paste0("CV AUC = ", round(cvauc, 3))) # Test set predictions testPreds <- predict(SonarPLS, testing) cvauc <- auc(predictor = testPreds, testing$Class, direction = ">") qplot(testing$Class, testPreds, geom = c("boxplot", "jitter")) + xlab("Testing class") + ylab("Testing predictions") + ggtitle(paste0("Test AUC = ", round(cvauc, 3)))
We can also run e.g. a logistic regression with these predictions, or use them for other downstream stuff.
# Training (CV) glm(obs ~ pred, data = savedPreds, family = binomial) %>% summary # Test glm(testing$Class == "M" ~ testPreds, family = binomial) %>% summary
As you can see, the CV performance tends to be quite close to the test set performance in this setup.
Now we would like to get some information on the variables included in the model. Luckily we can easily extract the loadings from the model, either from the full training model or from our favourite CV run.
# Retrieve the predictions with the best SonarPLS %>% get_loadings %>% head # Returns a data.frame for easy plotting SonarPLS %>% get_loadings("CV", rep = "Rep3") %>% ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + facet_wrap(~ comp, scales = "free_x") + geom_errorbar() + geom_bar(stat = "identity") + ggtitle("CV, median rep (Rep3)") SonarPLS %>% get_loadings("finalModel") %>% ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + facet_wrap(~ comp, scales = "free_x") + geom_errorbar() + geom_bar(stat = "identity") + ggtitle("Full model")
Hope this proves useful for you! Read documentation on the methods with ?get_mixOmics_spls
, ?get_loadings
and ?get_best_predictions
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.