selwik: Heuristic selection of the dimension of PLSR models with a...

View source: R/selwik.R

selwikR Documentation

Heuristic selection of the dimension of PLSR models with a permutation test on scores

Description

The function helps selecting the dimension (i.e. nb. components) of PLSR models.

The method was proposed by Wiklund et al. 2007 and Faber et al. 2007. For a given PLS score t, the principle is to compare the observed covariance Cov(Y, t) (where Y is the response) to the distribution H0 of simulated Cov(Y, t) computed on randomly permuted data (in which the relation between Y and X is assumed being removed). A significant observed covariance compared to distribution H0 is expected indicating a meaningful dimension.

The method can be time-consuming, especially for large datasets, since permutations are conditional to each component taken successively (successive one-dimension PLSR). A one-dimension PLSR is firstly implemented, data Y are randomly permuted (referred to as "Y-scambling"), and distribution H0 is computed. Then, information contained in the first dimension is removed from the data by deflation, and a the next dimension is studied by a new one-dimension PLSR, and so on.

Wiklund et al. 2007 and Faber et al. 2007 presented the method for PLSR1 models only (univariate Y). The function extends the method to PLSR2 (multivariate Y).

The function returns the p-value of the on-side test, i.e. the proportion of distribution H0 higher than the observed covariance.

Usage


selwik(
    X, Y, ncomp, 
    algo = NULL, weights = NULL,
    nperm = 50, seed = NULL, 
    print = TRUE, 
    ...
    )

Arguments

X

A n x p matrix or data frame of variables.

Y

A n x q matrix or data frame, or vector of length n for PLS1, of responses.

ncomp

The maximal number of scores (i.e. components = latent variables) to be calculated.

algo

A function implementing a PLS. Default to NULL (pls_kernel is used).

weights

A vector of length n defining a priori weights to apply to the observations. Internally, weights are "normalized" to sum to 1. Default to NULL (weights are set to 1 / n).

nperm

Number of random permutations.

seed

An integer defining the seed for the random simulation, or NULL (default). See set.seed.

print

Logical. If TRUE, fitting information are printed.

...

Optionnal arguments to pass in the function defined in algo.

Value

A list with outputs, see the examples.

References

Faber, N.M., Rajko, R., 2007. How to avoid over-fitting in multivariate calibration—The conventional validation approach and an alternative. Analytica Chimica Acta, Papers presented at the 10th International Conference on Chemometrics in Analytical Chemistry 595, 98-106. https://doi.org/10.1016/j.aca.2007.05.030

Wiklund, S., Nilsson, D., Eriksson, L., Sjöström, M., Wold, S., Faber, K., 2007. A randomization test for PLS component selection. Journal of Chemometrics 21, 427–439. https://doi.org/10.1002/cem.1086

Examples


data(datcass)
Xr <- datcass$Xr
yr <- datcass$yr

z <- selwik(Xr, yr, ncomp = 20, nperm = 30)
names(z)
plot(z$ncomp, z$pval,
     type = "b", pch = 16, col = "#045a8d",
     xlab = "Nb components", ylab = "p-value",
     main = "Wiklund et al. test")
alpha <- .10
abline(h = alpha, col = "grey")
u <- which(z$pval >= alpha)
opt <- min(u) - 1
opt


mlesnoff/rnirs documentation built on April 24, 2023, 4:17 a.m.