Heuristic selection of the dimension of a PCA or PLS model using angles between bootstrapped loading matrices


The function helps selecting the dimension (i.e. nb. components) of a PCA or PLS by bootstrapping the observations and exploring the stability of the loading matrix P. Stability is quantified by angles between the boostrapped matrices.

The general idea was proposed by Ye & Weiss 2003 for the sliced inverse regression, and applied to PCA by Luo & Li 2016. The loading matrix P (with a total number of A columns, i.e. loading vectors) is computed on the raw matrix X. Then, a non parametric bootstrap is implemented on the rows of matrix X, and the loading matrices P(b) b = 1,...,B are calculated for each bootstrap replication b, all with A columns.

For a given model dimension a <= A, an "angle" is then calculated between the raw matrix P and each matrix P(b), all with considering only the first a columns. The stability indicator for a matrix P with a vectors is the mean of the B angles.

Higher is the mean angle (meaning that the compared matrices do not span the same space), lower is the stability of matrix P whose some last columns were probably with large uncertainty.

Two measures of angle are proposed, depending on argument angle

1) Default: The "maxsub" angle (See Krzanowski, 1979, Hubert et al 2005, and Engelen et al. 2005).

2) The vector correlation coefficient "q" (Hotelling 1936) used by Ye & Weiss 2003 and Luo & Li 2016).

Print function rnirs::.corvec for the formulas.

Angles are first computed in radians (the right angle = pi / 2), and then divided by pi / 2 to vary between 0 and 1 (1 = minimal stability).

Jumps in the curve of the mean angle, followed by regular patterns are also informative.


    X, Y = NULL, ncomp = NULL, algo = NULL, 
    B = 50, seed = NULL,
    angle = c("maxsub", "hot"),
    plot = TRUE, 
    xlab = "Nb. components", ylab = NULL,
    print = TRUE, 



A n x p matrix or data frame of variables.


For PLS, a n x q matrix or data frame, or a vector of length n, of responses. If NULL (default) a PCA is implented.


The maximal number of PCA or PLS scores (= components = latent variables) to be calculated.


For pca, a function (algorithm) implementing a PCA. Default to NULL: if n < p, pca_eigenk is used; in the other case, pca_eigen is used. For pls, a function implementing a PLS. Default to NULL (pls_kernel is used).


Number of bootstrap replications.


An integer defining the seed for the random simulation, or NULL (default). See set.seed.


Type of angle. Possible values are "maxsub" (default) or "hot" (q of Hotelling).


Logical. If TRUE (default), results are plotted.


Label for the x-axis of the plot.


Label for the y-axis of the plot.


Logical. If TRUE, fitting information are printed.


Optionnal arguments to pass in the function defined in algo.


A list with output r = vector of the standardized angle.


Engelen, S., Hubert, M., Branden, K.V., 2005. A Comparison of Three Procedures for Robust PCA in High Dimensions. Austrian Journal of Statistics 34, 117-126-117-126.

Hubert, M., Rousseeuw, P.J., Vanden Branden, K., 2005. ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64-79.

Krzanowski, W.J., 1979. Between-Groups Comparison of Principal Components. Journal of the American Statistical Association 74, 703-707.

Luo, W., Li, B., 2016. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875-887.

Ye, Z., Weiss, R.E., 2003. Using the Bootstrap to Select One of a New Class of Dimension Reduction Methods. Jasa 98, 968-979.


Xr <- datcass$Xr
yr <- datcass$yr

ncomp <- 30
selangle(Xr, yr, ncomp = ncomp, B = 10)

