selcoll: Heuristic selection of the dimension of a PCA or PLS model...

View source: R/selcoll.R

selcollR Documentation

Heuristic selection of the dimension of a PCA or PLS model using collinearity between bootstrapped loading (or b-coefficient) vectors

Description

The function helps selecting the dimension (i.e. nb. components) of a PCA or PLS by bootstrapping the observations and exploring the collinearity of the loading vectors of same rank or, for PLS (univariate), eventually the b-coefficient vectors.

The principle is detailed below for loading vectors (the same applies to b-coefficient vectors).

A non parametric bootstrap is implemented on the rows of matrix X (and eventually Y if PLS), and the loading matrices P(b) b = 1,...,B are calculated for each bootstrap replication b, all with a total number of A columns. For a given model dimension a <= A, the B loading vectors corresponding to loadings "a" (column a in matrices P(b)) are set in a matrix V(a) (this last matrix has B columns).

Then, two alternative measures of collinearity are proposed, depending on argument corr:

1) Default method. Correlation coefficients are calculated between couples of columns V(a) and set in a vector v. The non-collinearity indicator r is the quantile of the elements in v (by default, prob = 1, correspondind to max(v)).

2) A SVD decompostion of V(a) is computed, and the collinearity measure r between the B vectors is given by proportion of variance accounted by the first SVD dimension (i.e. r = eig[1] / sum(eigs)).

Low collinearity between the vectors of rank a (columns of matrix matrix V(a)) may indicate they may have built with large uncertainity (generating unstability in V(a)). Jumps in the curve of r, followed by regular patterns are also informative.

Usage


selcoll(
    X, Y = NULL, ncomp = NULL, algo = NULL,
    B = 50, seed = NULL,
    type = c("P", "b"),
    coll = c("corr", "eig"),
    prob = 1,
    plot = TRUE, 
    xlab = "Nb. components", ylab = NULL,
    print = TRUE, 
    ...
    )

Arguments

X

A n x p matrix or data frame of variables.

Y

For PLS, a n x q matrix or data frame, or a vector of length n, of responses. If NULL (default) a PCA is implented.

ncomp

The maximal number of PCA or PLS scores (= components = latent variables) to be calculated.

algo

For pca, a function (algorithm) implementing a PCA. Default to NULL: if n < p, pca_eigenk is used; in the other case, pca_eigen is used. For pls, a function implementing a PLS. Default to NULL (pls_kernel is used).

B

Number of bootstrap replications.

seed

An integer defining the seed for the random simulation, or NULL (default). See set.seed.

type

Type of output whose the stability is evaluated. Possible values are "P" (loadings; default) and or "b" (b-coefficients).

coll

Type of collinearity measure. Possible values are "corr" (quantile of correlation coefficeints; default) or "eig" (SVD decomposition).

prob

Probability level for quantile (default to 1; the maximal vaule is considered).

plot

Logical. If TRUE (default), results are plotted.

xlab

Label for the x-axis of the plot.

ylab

Label for the y-axis of the plot.

print

Logical. If TRUE, fitting information are printed.

...

Optionnal arguments to pass in the function defined in algo.

Value

A list with output r, see examples.

Examples


data(datcass)
Xr <- datcass$Xr
yr <- datcass$yr

ncomp <- 30
selcoll(Xr, ncomp = ncomp, B = 10)


mlesnoff/rnirs documentation built on April 24, 2023, 4:17 a.m.