d.spls.cv: Determination of the number of latent components to be used...

d.spls.cvR Documentation

Determination of the number of latent components to be used in a Dual-SPLS regression

Description

The function d.spls.cv uses the cross validation approach described in Boulesteix and Strimmer (2005) (see in references) in order to choose the most adequat number of latent components for a Dual-SPLS regression.

Usage

d.spls.cv(X,Y,ncomp,dspls="lasso",ppnu,nu2,nrepcv=30,pctcv=70,indG,gamma)

Arguments

X

a numeric matrix of predictors values of dimension (n,p). Each row represents one observation and each column one predictor variable.

Y

a numeric vector or a one column matrix of responses. It represents the response variable for each observation.

ncomp

a positive integer or a numeric vector of the number of Dual-SPLS components to choose from.

dspls

the norm type of the Dual-SPLS regression applied. Default value is lasso. Options are pls, LS, ridge, GLA, GLB and GLC.

ppnu

a positive real value, in [0,1]. ppnu is the desired proportion of variables to shrink to zero for each component (see Dual-SPLS methodology).

nu2

a positive real value. nu2 is a constraint parameter used in the ridge norm.

nrepcv

a positive integer indicating the number of cross-validation iterations to be performed. Default value is 30.

pctcv

a positive real value in [0,100] representing the percentage of observation to be considered in for the calibration set at each CV iteration. Default value is 70.

indG

a numeric vector of group index for each observation. It is used in the cases of the group lasso norms.

gamma

a numeric vector of the norm \Omega of each w_g in case GLB.

Details

The procedure is described in the Boulesteix and Strimmer. It is based on randomly selecting, pctcv% of calibration observations at each cross validation iteration and performing the Dual-SPLS regression. The rest of the observation are used as a validation and the errors are computed accordingly for each components. nrepcv iterations are done and the mean squared of each of the nrepcv errors for each component are computed. The latent component with the smallest mean value is selected as the best.

Value

A integer representing the best number of latent components to be used in the Dual-SPLS regression based on the cross validation procedure.

Author(s)

Louna Alsouki François Wahl

References

A. L. Boulesteix and K. Strimmer (2005). Predicting Transcription Factor Activities from Combined Analysis of Microarray and ChIP Data: A Partial Least Squares Approach.

H. Wold. Path Models with Latent Variables: The NIPALS Approach. In H.M. Blalock et al., editor, Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pages 307–357. Academic Press, 1975.

Examples

### load dual.spls library
library(dual.spls)
### constructing the simulated example
oldpar <- par(no.readonly = TRUE)
n <- 100
p <- 50
nondes <- 20
sigmaondes <- 0.5
data=d.spls.simulate(n=n,p=p,nondes=nondes,sigmaondes=sigmaondes)

X <- data$X
y <- data$y

#fitting the PLS model
ncomp_PLS <- d.spls.cv(X=X,Y=y,ncomp=10,dspls="pls",nrepcv=20,pctcv=75)
mod.dspls.pls <- d.spls.pls(X=X,y=y,ncp=ncomp_PLS,verbose=TRUE)

str(mod.dspls.pls)

### plotting the observed values VS predicted values for ncomp components
plot(y,mod.dspls.pls$fitted.values[,ncomp_PLS], xlab="Observed values", ylab="Predicted values",
 main=paste("Observed VS Predicted for ", ncomp_PLS," components"))
points(-1000:1000,-1000:1000,type='l')

### plotting the regression coefficients
par(mfrow=c(3,1))

i=ncomp_PLS
plot(1:dim(X)[2],mod.dspls.pls$Bhat[,i],type='l',
    main=paste(" Dual-SPLS (PLS), ncp =", i,
    ylab='',xlab='' ))


#fitting the Dual-SPLS lasso model

ncomplasso <- d.spls.cv(X=X,Y=y,ncomp=10,dspls="lasso",ppnu=0.9,nrepcv=20,pctcv=75)
mod.dspls.lasso <- d.spls.lasso(X=X,y=y,ncp=ncomplasso,ppnu=0.9,verbose=TRUE)

str(mod.dspls.lasso)

### plotting the observed values VS predicted values for ncomp components
plot(y,mod.dspls.lasso$fitted.values[,ncomplasso], xlab="Observed values", ylab="Predicted values",
main=paste("Observed VS Predicted for ", ncomplasso," components"))
points(-1000:1000,-1000:1000,type='l')

### plotting the regression coefficients
par(mfrow=c(3,1))

i=ncomplasso
nz=mod.dspls.lasso$zerovar[i]
plot(1:dim(X)[2],mod.dspls.lasso$Bhat[,i],type='l',
    main=paste(" Dual-SPLS (lasso), ncp =", i, " #0coef =", nz, "/", dim(X)[2]),
    ylab='',xlab='' )
inonz=which(mod.dspls.lasso$Bhat[,i]!=0)
points(inonz,mod.dspls.lasso$Bhat[inonz,i],col='red',pch=19,cex=0.5)
legend("topright", legend ="non null values", bty = "n", cex = 0.8, col = "red",pch=19)
par(oldpar)

dual.spls documentation built on April 19, 2023, 1:07 a.m.