tune.spls: Tuning functions for sPLS method
In mixOmics: Omics Data Integration Project

Description Usage Arguments Details Value Author(s) References See Also Examples

Computes M-fold or Leave-One-Out Cross-Validation scores on a user-input grid to determine optimal values for the sparsity parameters in spls.

tune.spls(X, Y, ncomp = 1,
test.keepX = c(5, 10, 15), already.tested.X,
validation = "Mfold", folds = 10, measure = "MSE", scale = TRUE,
progressBar = TRUE, tol = 1e-06, max.iter = 100, near.zero.var = FALSE,
nrepeat = 1, multilevel = NULL, light.output = TRUE, cpus)

`X`	numeric matrix of predictors. `NA`s are allowed.
`Y`	`if(method = 'spls')` numeric vector or matrix of continuous responses (for multi-response models) `NA`s are allowed.
`ncomp`	the number of components to include in the model.
`test.keepX`	numeric vector for the different number of variables to test from the X data set
`already.tested.X`	Optional, if `ncomp > 1` A numeric vector indicating the number of variables to select from the X data set on the firsts components.
`validation`	character. What kind of (internal) validation to use, matching one of `"Mfold"` or `"loo"` (see below). Default is `"Mfold"`.
`folds`	the folds in the Mfold cross-validation. See Details.
`measure`	One of `MSE, MAE, Bias` or `R2`. Default to `MSE`. See details
`scale`	boleean. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE)
`progressBar`	by default set to `TRUE` to output the progress bar of the computation.
`tol`	Convergence stopping value.
`max.iter`	integer, the maximum number of iterations.
`near.zero.var`	boolean, see the internal `nearZeroVar` function (should be set to TRUE in particular for data with many zero values). Default value is FALSE
`nrepeat`	Number of times the Cross-Validation process is repeated.
`multilevel`	Design matrix for multilevel analysis (for repeated measurements) that indicates the repeated measures on each individual, i.e. the individuals ID. See Details.
`light.output`	if set to FALSE, the prediction/classification of each sample for each of `test.keepX` and each comp is returned.
`cpus`	Number of cpus to use when running the code in parallel.

This tuning function should be used to tune the parameters in the spls function (number of components and the number of variables in keepX to select).

If validation = "loo", leave-one-out cross-validation is performed. By default folds is set to the number of unique individuals. If validation = "Mfold", M-fold cross-validation is performed. How many folds to generate is selected by specifying the number of folds in folds.

Four measures of accuracy are available: Mean Absolute Error (MAE), Mean Square Error(MSE), Bias and R2. Both MAE and MSE average the model prediction error. MAE measures the average magnitude of the errors without considering their direction. It is the average over the fold test samples of the absolute differences between the Y predictions and the actual Y observations. The MSE also measures the average magnitude of the error. Since the errors are squared before they are averaged, the MSE tends to give a relatively high weight to large errors. The Bias is the average of the differences between the Y predictions and the actual Y observations and the R2 is the correlation between the predictions and the observations. All those measures are averaged across all Y variables in the PLS2 case. We are still improving the function to tune an sPLS2 model, contact us for more details and examples.

The function outputs the optimal number of components that achieve the best performance based on the chosen measure of accuracy. The assessment is data-driven and similar to the process detailed in (Rohart et al., 2016), where one-sided t-tests assess whether there is a gain in performance when adding a component to the model.

See also ?perf for more details.

A list that contains:

`error.rate`	returns the prediction error for each `test.keepX` on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.
`choice.keepX`	returns the number of variables selected (optimal keepX) on each component.
`choice.ncomp`	returns the optimal number of components for the model fitted with `$choice.keepX` and `$choice.keepY`
`measure`	reminds which criterion was used
`predict`	Prediction values for each sample, each `test.keepX,test.keepY`, each comp and each repeat. Only if light.output=FALSE

Kim-Anh Lê Cao, Benoit Gautier, Francois Bartolo, Florian Rohart.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

PLS and PLS citeria for PLS regression: Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.

Chavent, Marie and Patouille, Brigitte (2003). Calcul des coefficients de regression et du PRESS en regression PLS1. Modulad n, 30 1-11. (this is the formula we use to calculate the Q2 in perf.pls and perf.spls)

Mevik, B.-H., Cederkvist, H. R. (2004). Mean Squared Error of Prediction (MSEP) Estimates for Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Journal of Chemometrics 18(9), 422-429.

sparse PLS regression mode:

Lê Cao, K. A., Rossouw D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.

One-sided t-tests (suppl material):

Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ 4:e1845.

splsda, predict.splsda and http://www.mixOmics.org for more details.

## Not run: 
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

tune = tune.spls(X, Y, ncomp=4, test.keepX = c(5,10,15), measure = "MSE",
nrepeat=3, progressBar = TRUE)

tune$choice.ncomp
tune$choice.keepX

# plot the results
plot(tune)

## End(Not run)