spls.cv: Cross-validation procedure to calibrate the parameters...

View source: R/spls.cv.R

spls.cvR Documentation

Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1) of the Adaptive Sparse PLS regression

Description

The function spls.cv chooses the optimal values for the hyper-parameter of the spls procedure, by minimizing the mean squared error of prediction over the hyper-parameter grid, using Durif et al. (2017) adaptive SPLS algorithm.

Usage

spls.cv(X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE,
  center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE,
  weighted.center = FALSE, return.grid = FALSE, ncores = 1, nfolds = 10,
  nrun = 1, verbose = FALSE)

Arguments

X

a (n x p) data matrix of predictors. X must be a matrix. Each row corresponds to an observation and each column to a predictor variable.

Y

a (n) vector of (continuous) responses. Y must be a vector or a one column matrix. It contains the response variable for each observation.

lambda.l1.range

a vecor of positive real values, in [0,1]. lambda.l1 is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among lambda.l1.range.

ncomp.range

a vector of positive integers. ncomp is the number of PLS components. The optimal value will be chosen among ncomp.range.

weight.mat

a (ntrain x ntrain) matrix used to weight the l2 metric in the observation space, it can be the covariance inverse of the Ytrain observations in a heteroskedastic context. If NULL, the l2 metric is the standard one, corresponding to homoskedastic model (weight.mat is the identity matrix).

adapt

a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).

center.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be centered or not.

center.Y

a boolean value indicating whether the response values Ytrain set should be centered or not.

scale.Y

a boolean value indicating whether the response values Ytrain should be scaled or not (scale.Y=TRUE implies center.Y=TRUE).

weighted.center

a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).

return.grid

a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.

ncores

a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).

nfolds

a positive integer indicating the number of folds in the K-folds cross-validation procedure, nfolds=n corresponds to the leave-one-out cross-validation, default is 10.

nrun

a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.

verbose

a boolean value indicating verbosity.

scale.X

aa

boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be scaled or not (scale.X=TRUE implies center.X=TRUE).

Details

The columns of the data matrices Xtrain and Xtest may not be standardized, since standardizing can be performed by the function spls.cv as a preliminary step.

The procedure is described in Durif et al. (2017). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the mean squared error of prediction averaged over all the folds.

This procedures uses the mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

Value

An object with the following attributes

lambda.l1.opt

the optimal value in lambda.l1.range.

ncomp.opt

the optimal value in ncomp.range.

cv.grid

the grid of hyper-parameters and corresponding prediction error rate over the folds. cv.grid is NULL if return.grid is set to FALSE.

Author(s)

Ghislain Durif (http://thoth.inrialpes.fr/people/gdurif/).

References

Durif G., Modolo L., Michaelsson J., Mold J. E., Lambert-Lacroix S., Picard F. (2017). High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression, (in prep), available on (http://arxiv.org/abs/1502.05933).

See Also

spls

Examples

## Not run: 
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, 
                       beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                       sigma.H=10, sigma.F=5, sigma.E=5)
                       
X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10

### tuning the hyper-parameters
cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, 
               ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, 
               center.X=TRUE, center.Y=TRUE, 
               scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, 
               return.grid=TRUE, ncores=1, nfolds=10, nrun=1)
str(cv1)

### otpimal values
cv1$lambda.l1.opt
cv1$ncomp.opt

## End(Not run)


plsgenomics documentation built on Nov. 27, 2023, 5:08 p.m.