Cross model validation

Share:

Description

Performs cross model validation (2CV) with different PLS analyses.

Usage

1
2
3
4
5
MVA.cmv(X, Y, repet = 10, kout = 7, kinn = 6, ncomp = 8, scale = TRUE,
  model = c("PLSR", "CPPLS", "PLS-DA", "PPLS-DA", "PLS-DA/LDA", "PLS-DA/QDA",
  "PPLS-DA/LDA", "PPLS-DA/QDA"), crit.inn = c("RMSEP", "Q2", "NMC"),
  Q2diff = 0.05, lower = 0.5, upper = 0.5, Y.add = NULL, weights = rep(1, nrow(X)),
  set.prior = FALSE, crit.DA = c("plug-in", "predictive", "debiased"), ...)

Arguments

X

a data frame of independent variables.

Y

the dependent variable(s): numeric vector, data frame of quantitative variables or factor.

repet

an integer giving the number of times the whole 2CV procedure has to be repeated.

kout

an integer giving the number of folds in the outer loop (can be re-set internally if needed).

kinn

an integer giving the number of folds in the inner loop (can be re-set internally if needed). Cannot be > kout.

ncomp

an integer giving the maximal number of components to be tested in the inner loop (can be re-set depending on the size of the train sets).

scale

logical indicating if data should be scaled (see Details).

model

the model to be fitted (see Details).

crit.inn

the criterion to be used to choose the number of components in the inner loop. Root Mean Square Error of Prediction ("RMSEP", default) and Q2 ("Q2") are only used for PLSR and CPPLS, whereas the Number of MisClassifications ("NMC") is only used for discriminant analyses.

Q2diff

the threshold to be used if the number of components is chosen according to Q2. The next component is added only if it makes the Q2 increase more than Q2diff (5% by default).

lower

a vector of lower limits for power optimisation in CPPLS or PPLS-DA (see cppls.fit).

upper

a vector of upper limits for power optimisation in CPPLS or PPLS-DA (see cppls.fit).

Y.add

a vector or matrix of additional responses containing relevant information about the observations, in CPPLS or PPLS-DA (see cppls.fit).

weights

a vector of individual weights for the observations, in CPPLS or PPLS-DA (see cppls.fit).

set.prior

only used when a second analysis (LDA or QDA) is performed. If TRUE, the prior probabilities of class membership are defined according to the mean weight of individuals belonging to each class. If FALSE, prior probabilities are obtained from the data sets on which LDA/QDA models are built.

crit.DA

criterion used to predict class membership when a second analysis (LDA or QDA) is used. See predict.lda.

...

other arguments to pass to plsr (PLSR, PLS-DA) or cppls (CPPLS, PPLS-DA).

Details

Cross model validation is detailed is Szymanska et al (2012). Some more details about how this function works:

- when a discriminant analysis is used ("PLS-DA", "PPLS-DA", "PLS-DA/LDA", "PLS-DA/QDA", "PPLS-DA/LDA" or "PPLS-DA/QDA"), the training sets (test set itself in the inner loop, test+validation sets in the outer loop) are generated in respect to the relative proportions of the levels of Y in the original data set (see splitf).

- "PLS-DA" is considered as PLS2 on a dummy-coded response. For a PLS-DA based on the CPPLS algorithm, use "PPLS-DA" with lower and upper limits of the power parameters set to 0.5.

- if a second analysis is used ("PLS-DA/LDA", "PLS-DA/QDA", "PPLS-DA/LDA" or "PPLS-DA/QDA"), a LDA or QDA is built on scores of the first analysis (PLS-DA or PPLS-DA) also in the inner loop. The number of misclassifications, based on this second analysis, is used to choose the number of components.

If scale = TRUE, the scaling is done as this:

- for each step of the outer loop (i.e. kout steps), the rest set is pre-processed by centering and unit-variance scaling. Means and standard deviations of variables in the rest set are then used to scale the test set.

- for each step of the inner loop (i.e. kinn steps), the training set is pre-processed by centering and unit-variance scaling. Means and standard deviations of variables in the training set are then used to scale the validation set.

Value

model

model used.

type

type of model used.

repet

number of times the whole 2CV procedure was repeated.

kout

number of folds in the outer loop.

kinn

number of folds in the inner loop.

crit.inn

criterion used to choose the number of components in the inner loop.

crit.DA

criterion used to classify individuals of the test and validation sets.

Q2diff

threshold used if the number of components is chosen according to Q2.

groups

levels of Y if it is a factor.

models.list

list of of models generated (repet*kout models), for PLSR, CPPLS, PLS-DA and PPLS-DA.

models1.list

list of of (P)PLS-DA models generated (repet*kout models), for PLS-DA/LDA, PLS-DA/QDA, PPLS-DA/LDA and PPLS-DA/QDA.

models2.list

list of of LDA/QDA models generated (repet*kout models), for PLS-DA/LDA, PLS-DA/QDA, PPLS-DA/LDA and PPLS-DA/QDA.

RMSEP

RMSEP computed from the models used in the outer loops (repet values).

Q2

Q2 computed from the models used in the outer loops (repet values).

NMC

NMC computed from the models used in the outer loops (repet values).

Author(s)

Maxime Herv<e9> <mx.herve@gmail.com>

References

Szymanska E, Saccenti E, Smilde AK and Westerhuis J (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics (2012) 8:S3-S16.

See Also

predict.MVA.cmv, mvr, lda, qda

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
require(pls)
require(MASS)

# PLSR
data(yarn)
## Not run: MVA.cmv(yarn$NIR,yarn$density,model="PLSR")

# PPLS-DA coupled to LDA
data(mayonnaise)
## Not run: MVA.cmv(mayonnaise$NIR,factor(mayonnaise$oil.type),model="PPLS-DA/LDA",crit.inn="NMC")

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.