In jonathanth/mixOmicsCaret: mixOmicsCaret

Vignette written in 2017, small update in 2022.

This is an example of how we can use sparse Partial Least Squares for modelling. The model is implemented in the mixOmics package (http://mixomics.org/). The package has some cross-validation functionality but this is not nearly as rich as what the caret package offers. But luckily, it can be added to the caret ecosystem to take advantage of all these nice features.

This small utility package does just that and can be installed from github using devtools::install_github("jonathanth/mixOmicsCaret")

Let's look at how to use it.

Setup

First, we load some required packages.

library(mlbench)
data(Sonar)
library(doMC)
library(caret)
library(pROC)
library(mixOmics)
library(tidyverse)
library(mixOmicsCaret)  # This package
options(width = 90)

Then we can register doMC for the built-in multithreading in caret. A common option is number of threads minus one. On a quad-core with hyperthreading (Macbook Pro i7), that's 7.

registerDoMC(cores = 7)

Data

We start by examining the dummy data we'll be using. It's called Sonar and comes from the mlbench package.

head(Sonar)
Sonar$Class
dim(Sonar)

Alright, 60 predictors, 208 observations, 2 classes. Let's partition the data into a training set and a test set.

set.seed(123)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing  <- Sonar[-inTraining,]

Model and cross-validation

Now we will setup a cross validation to see how many predictors we should include. First, let's set up the options. It will run 5 repeats of 10-fold CV. We will keep the resamples and the predictions, and allow it to run in parallel. We will turn off verbose-mode for now. We save this as a trainControl object, which is just a set of options for caret.

repCV10 <- trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 5, 
                        returnResamp = "all", 
                        savePredictions = "all", 
                        allowParallel = T, 
                        verboseIter = F)

Now for the CV loop itself. This will take some time. We have asked it to model Class M as a numeric 0/1, which is fine for spls but caret throws a warning we can safely ignore. We will try 1 to 5 components, with a range of keepX values. This is the number of predictors allowed per component. Note the fixX argument, which is left empty for now. This allows us to manually specify number of predictors per component.

set.seed(123)
SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training,
                  method = get_mixOmics_spls(),
                  preProc = c("center", "scale"),
                  tuneGrid = expand.grid(ncomp = 1:5, 
                                         keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), 
                                         keepY = 1),
                  trControl = repCV10,
                  fixX = c())

Let's inspect the results!

SonarPLS
ggplot(SonarPLS, metric = "RMSE")
ggplot(SonarPLS, metric = "Rsquared")

We can also manually extract the predictions, calculate AUC, and plot it:

SonarPLS$pred %>%
  separate(Resample, c("Fold", "Rep")) %>%
  group_by(ncomp, keepX, Rep) %>%
  summarize(auc = pROC::auc(predictor = pred, obs, direction = "<") %>% as.numeric) %>%
  ggplot(aes(x = factor(keepX), y = auc)) +
    geom_boxplot(outlier.shape = NA) +
    geom_jitter(aes(color = factor(ncomp)), alpha = 0.2, show.legend = FALSE) +
    geom_hline(yintercept = 0.5) +
    facet_grid(. ~ ncomp) +
    scale_color_brewer(palette = "Set1", name = NULL) +
    theme_bw() + theme(strip.background = element_blank()) +
    xlab("Number of variables")

Looks like the first component saturates at 25 predictors. Let's use the fixX parameter to force comp 1 to have 25 predictors and let the following components try out all combinations. This allows us to flexibly set individual parameter numbers for each component. Lets re-run the cross-validation with this option.

set.seed(123)
SonarPLS <- train(as.numeric(Class == "M") ~ ., data = training,
                  method = get_mixOmics_spls(),
                  preProc = c("center", "scale"),
                  tuneGrid = expand.grid(ncomp = 1:5, 
                                         keepX = c(1, 2, 4, 8, 16, 25, 40, 50, 60), 
                                         keepY = 1),
                  trControl = repCV10,
                  fixX = c(25))
ggplot(SonarPLS, metric = "Rsquared")

Alright, we can do a little bit better if we add another component with 60 variables. But after that, the model starts overfitting and the CV-error increases / Rsquared decreases.

Evaluate the final model

Looks like we have found our final model! Lets retrieve the predicted values, that we can use for unbiased downstream analyses. We want an average across the repeats to ensure we didn't just get lucky with the fold divisions. Since we have a test set, we can also look at the test set performance of the final model, using the predict function on the SonarPLS object with the testing dataframe as new data.

# Retrieve the predictions from the optimal model
bestSet <- SonarPLS %>% get_best_predictions %>% group_by(Rep) 
head(bestSet)
bestSet %>% 
  summarize(auc = pROC::auc(obs, pred, direction = "<") %>% as.numeric) %>% 
  arrange(auc)
savedPreds <- SonarPLS %>% get_best_predictions %>% filter(Rep == "Rep3")
head(savedPreds)

# Best CV predictions
cvauc <- auc(obs ~ pred, data = savedPreds, direction = "<")
qplot(ifelse(obs == 1, "M", "R"), pred, data = savedPreds, geom = c("boxplot", "jitter")) + 
  xlab("Training class") +
  ylab("Training predictions") +
  ggtitle(paste0("CV AUC = ", round(cvauc, 3)))

# Test set predictions
testPreds <- predict(SonarPLS, testing)
cvauc <- auc(predictor = testPreds, testing$Class, direction = ">")
qplot(testing$Class, testPreds, geom = c("boxplot", "jitter")) + 
  xlab("Testing class") +
  ylab("Testing predictions") +
  ggtitle(paste0("Test AUC = ", round(cvauc, 3)))

We can also run e.g. a logistic regression with these predictions, or use them for other downstream stuff.

# Training (CV)
glm(obs ~ pred, data = savedPreds, family = binomial) %>% summary

# Test
glm(testing$Class == "M" ~ testPreds, family = binomial) %>% summary

As you can see, the CV performance tends to be quite close to the test set performance in this setup.

Interpretation (Loadings)

Now we would like to get some information on the variables included in the model. Luckily we can easily extract the loadings from the model, either from the full training model or from our favourite CV run.

# Retrieve the predictions with the best 
SonarPLS %>% get_loadings %>% head # Returns a data.frame for easy plotting

SonarPLS %>% get_loadings("CV", rep = "Rep3") %>% 
  ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + 
  facet_wrap(~ comp, scales = "free_x") + 
  geom_errorbar() + 
  geom_bar(stat = "identity") + 
  ggtitle("CV, median rep (Rep3)")

SonarPLS %>% get_loadings("finalModel") %>% 
  ggplot(aes(var, loading, ymin = loading - sd, ymax = loading + sd)) + 
  facet_wrap(~ comp, scales = "free_x") + 
  geom_errorbar() + 
  geom_bar(stat = "identity") + 
  ggtitle("Full model")