tune.spls: Tuning functions for sPLS method

View source: R/tune.spls.R

tune.splsR Documentation

Tuning functions for sPLS method

Description

Computes M-fold or Leave-One-Out Cross-Validation scores on a user-input grid to determine optimal values for the parameters in spls.

Usage

tune.spls(
  X,
  Y,
  test.keepX = NULL,
  test.keepY = NULL,
  ncomp,
  mode = c("regression", "canonical", "classic"),
  scale = TRUE,
  logratio = "none",
  tol = 1e-09,
  max.iter = 100,
  near.zero.var = FALSE,
  multilevel = NULL,
  validation = c("Mfold", "loo"),
  nrepeat = 1,
  folds,
  measure = NULL,
  BPPARAM = SerialParam(),
  seed = NULL,
  progressBar = FALSE,
  ...
)

Arguments

X

numeric matrix of predictors with the rows as individual observations.

Y

numeric matrix of response(s) with the rows as individual observations matching X.

test.keepX

numeric vector for the different number of variables to test from the X data set.

test.keepY

numeric vector for the different number of variables to test from the Y data set. Default to ncol(Y).

ncomp

Positive Integer. The number of components to include in the model. Default to 2.

mode

Character string indicating the type of PLS algorithm to use. One of "regression", "canonical", "invariant" or "classic". See Details.

scale

Logical. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE

logratio

Character, one of ('none','CLR') specifies the log ratio transformation to deal with compositional values that may arise from specific normalisation in sequencing data. Default to 'none'. See ?logratio.transfo for details.

tol

Positive numeric used as convergence criteria/tolerance during the iterative process. Default to 1e-06.

max.iter

Integer, the maximum number of iterations. Default to 100.

near.zero.var

Logical, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Setting this argument to FALSE (when appropriate) will speed up the computations. Default value is FALSE.

multilevel

Numeric, design matrix for repeated measurement analysis, where multilevel decomposition is required. For a one factor decomposition, the repeated measures on each individual, i.e. the individuals ID is input as the first column. For a 2 level factor decomposition then 2nd AND 3rd columns indicate those factors. See examples.

validation

character. What kind of (internal) validation to use, matching one of "Mfold" or "loo" (Leave-One-out). Default is "Mfold".

nrepeat

Positive integer. Number of times the Cross-Validation process should be repeated. nrepeat > 2 is required for robust tuning. See details.

folds

Positive Integer, The folds in the Mfold cross-validation.

measure

The tuning measure to use. Cannot be NULL when applied to sPLS1 object. See details.

BPPARAM

A BiocParallelParam object indicating the type of parallelisation. See examples in ?tune.spca.

seed

set a number here if you want the function to give reproducible outputs. Not recommended during exploratory analysis. Note if RNGseed is set in 'BPPARAM', this will be overwritten by 'seed'.

progressBar

Logical. If TRUE a progress bar is shown as the computation completes. Default to FALSE.

...

Optional parameters passed to spls

Details

This tuning function should be used to tune the parameters in the spls function (number of components and number of variables to select).

Value

If test.keepX != NULL and test.keepY != NULL returns a list that contains:

cor.pred

The correlation of predicted vs actual components from X (t) and Y (u) for each component

RSS.pred

The Residual Sum of Squares of predicted vs actual components from X (t) and Y (u) for each component

choice.keepX

returns the number of variables selected for X (optimal keepX) on each component.

choice.keepY

returns the number of variables selected for Y (optimal keepY) on each component.

choice.ncomp

returns the optimal number of components for the model fitted with $choice.keepX and $choice.keepY

call

The functioncal call including the parameteres used.

If test.keepX = NULL and test.keepY = NULL returns a list with the following components for every repeat:

MSEP

Mean Square Error Prediction for each Y variable, only applies to object inherited from "pls", and "spls". Only available when in regression (s)PLS.

RMSEP

Root Mean Square Error Prediction for each Y variable, only applies to object inherited from "pls", and "spls". Only available when in regression (s)PLS.

R2

a matrix of R^2 values of the Y-variables for models with 1, \ldots ,ncomp components, only applies to object inherited from "pls", and "spls". Only available when in regression (s)PLS.

Q2

if Y contains one variable, a vector of Q^2 values else a list with a matrix of Q^2 values for each Y-variable. Note that in the specific case of an sPLS model, it is better to have a look at the Q2.total criterion, only applies to object inherited from "pls", and "spls". Only available when in regression (s)PLS.

Q2.total

a vector of Q^2-total values for models with 1, \ldots ,ncomp components, only applies to object inherited from "pls", and "spls". Available in both (s)PLS modes.

RSS

Residual Sum of Squares across all selected features and the components.

PRESS

Predicted Residual Error Sum of Squares across all selected features and the components.

features

a list of features selected across the folds ($stable.X and $stable.Y) for the keepX and keepY parameters from the input object. Note, this will be NULL if using standard (non-sparse) PLS.

cor.tpred, cor.upred

Correlation between the predicted and actual components for X (t) and Y (u)

RSS.tpred, RSS.upred

Residual Sum of Squares between the predicted and actual components for X (t) and Y (u)

folds

During a cross-validation (CV), data are randomly split into M subgroups (folds). M-1 subgroups are then used to train submodels which would be used to predict prediction accuracy statistics for the held-out (test) data. All subgroups are used as the test data exactly once. If validation = "loo", leave-one-out CV is used where each group consists of exactly one sample and hence M == N where N is the number of samples.

nrepeat

The cross-validation process is repeated nrepeat times and the accuracy measures are averaged across repeats. If validation = "loo", the process does not need to be repeated as there is only one way to split N samples into N groups and hence nrepeat is forced to be 1.

measure

  • For PLS2 Two measures of accuracy are available: Correlation (cor, used as default), as well as the Residual Sum of Squares (RSS). For cor, the parameters which would maximise the correlation between the predicted and the actual components are chosen. The RSS measure tries to predict the held-out data by matrix reconstruction and seeks to minimise the error between actual and predicted values. For mode='canonical', The X matrix is used to calculate the RSS, while for others modes the Y matrix is used. This measure gives more weight to any large errors and is thus sensitive to outliers. It also intrinsically selects less number of features on the Y block compared to measure='cor'.

  • For PLS1 Four measures of accuracy are available: Mean Absolute Error (MAE), Mean Square Error (MSE, used as default), Bias and R2. Both MAE and MSE average the model prediction error. MAE measures the average magnitude of the errors without considering their direction. It is the average over the fold test samples of the absolute differences between the Y predictions and the actual Y observations. The MSE also measures the average magnitude of the error. Since the errors are squared before they are averaged, the MSE tends to give a relatively high weight to large errors. The Bias is the average of the differences between the Y predictions and the actual Y observations and the R2 is the correlation between the predictions and the observations.

Optimisation Process

The optimisation process is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when incrementing the number of features or components in the model. However, it will assess all the provided grid through pair-wise comparisons as the performance criteria do not always change linearly with respect to the added number of features or components.

more

See also ?perf for more details.

Author(s)

Kim-Anh Lê Cao, Al J Abadi, Benoit Gautier, Francois Bartolo, Florian Rohart,

References

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

PLS and PLS citeria for PLS regression: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.

Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de regression et du PRESS en regression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)

Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.

sparse PLS regression mode:

Lê Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.

One-sided t-tests (suppl material):

Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ 4:e1845.

See Also

splsda, predict.splsda and http://www.mixOmics.org for more details.

Examples

## sPLS2 model example (more than one Y outcome variable)

# set up data
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

# tune spls model for components only
tune.res.ncomp <- tune.spls( X, Y, ncomp = 5,
                             test.keepX = NULL,
                             test.keepY = NULL, measure = "cor",
                             folds = 5, nrepeat = 3, progressBar = TRUE)
plot(tune.res.ncomp) # plot outputs

# tune spls model for number of X and Y variables to keep
tune.res <- tune.spls( X, Y, ncomp = 3,
                       test.keepX = c(5, 10, 15),
                       test.keepY = c(3, 6, 8), measure = "cor",
                       folds = 5, nrepeat = 3, progressBar = TRUE)
plot(tune.res) # plot outputs

## sPLS1 model example (only one Y outcome variable)

# set up data 
Y1 <- liver.toxicity$clinic[,1]

# tune spls model for components only
plot(tune.spls(X, Y1, ncomp = 3, 
               folds = 3, 
               test.keepX = NULL, test.keepY = NULL))

# tune spls model for number of X variables to keep, note for sPLS1 models 'measure' needs to be set
plot(tune.spls(X, Y1, ncomp = 3, 
               folds = 3, measure = "MSE",
               test.keepX = c(5, 10, 15), test.keepY = c(3, 6, 8)))


## sPLS2 multilevel model example

# set up multilevel design
repeat.indiv <- c(1, 2, 1, 2, 1, 2, 1, 2, 3, 3, 4, 3, 4, 3, 4, 4, 5, 6, 5, 5,
                  6, 5, 6, 7, 7, 8, 6, 7, 8, 7, 8, 8, 9, 10, 9, 10, 11, 9, 9,
                  10, 11, 12, 12, 10, 11, 12, 11, 12, 13, 14, 13, 14, 13, 14,
                  13, 14, 15, 16, 15, 16, 15, 16, 15, 16)
design <- data.frame(sample = repeat.indiv)

# tune spls model for components only
tune.res.ncomp <- tune.spls( X, Y, ncomp = 5,
                             test.keepX = NULL,
                             test.keepY = NULL, measure = "cor", multilevel = design,
                             folds = 5, nrepeat = 3, progressBar = TRUE)
plot(tune.res.ncomp) # plot outputs

# tune spls model for number of X and Y variables to keep
tune.res <- tune.spls( X, Y, ncomp = 3,
                       test.keepX = c(5, 10, 15),
                       test.keepY = c(3, 6, 8), measure = "cor", multilevel = design,
                       folds = 5, nrepeat = 3, progressBar = TRUE)
plot(tune.res) # plot outputs

mixOmicsTeam/mixOmics documentation built on Dec. 3, 2024, 11:15 p.m.