cppca: Mallows's Cp for PCA Models

View source: R/cppca.R

cppcaR Documentation

Mallows's Cp for PCA Models

Description

Calculation of a Mallows's Cp (Mallows 1973) criterion for PCA models. This function will probably be modified in the future (this is a work in progress).

For a model with a components and a matrix X of dimension n x p, function cppca calculates Cp by:

Cp(a) = SSR(a) / N + alpha * dfc(a) * sigma^2 / N

where SSR(a) is the sum of squared residuals for the current PCA model, dfc(a) an estimation of the PCA "corrected model complexity" (see thereafter), sigma^2 the irreductible error variance estimated from a low biased model, alpha a penalty coefficient, and N = n * p the number of (training) matrix elements.

## Penalty coefficient alpha

Depending on argument type, alpha is equal to either 2 (AIC penalty), 2 * N / (N - df - 1) (small sample size correction AICc penalty) or log(N) (BIC penalty).

## Corrected nb. model's degrees of freedom dfc

The model complexity of a fitted PCA model of dimension a is known to be df = p + a * (n - 1) + p * a - a^2 (Faber et al. 1994, Faber 2012, Josse & Husson 2012). Monte Carlo estimates (Ye 1998, Efron 2004) of df are consistent with this previous formula; see for instance the example presented for function dfpca_div.

Nevertheless and unfortunately, df can not be used directly in criteria such as Cp, GCV etc. for model selection. Such a procedure underestimates the prediction error that is normally targeted by these criteria, with the consequence of selecting models with too large dimensions. This is the same problem that encounters the row-wise PCA CV.

The underestimation comes from the fact that, in PCA, the new observations to predict, say Xnew, are directly used for computing the predictions Xnew_fit (Xnew_fit = Xnew * P * P'). Objects Xnew and Xnew_fit are therefore not independent, while such an independence is assumed in the theory leading to the good properties of Cp, GCV, MSPEPCV, etc. As a comparison, in usual supervised prediction models such as LMR or PLSR models, objects ynew and ynew_fit are independent.

The non-independence between Xnew and Xnew_fit consumes specific degrees of freedom that need to be added to df before using the Cp formula. Let us define the corrected number of degrees of freedom (used finally in the Cp formula) by dfc = df + theta, where theta corresponds to the additional part. We need to estimate dfc (or theta since df is known).

The approach propoed in rnirs (publication should come) uses the nice idea of Hassani et al. 2012 that compared SSR to SSRCV (where SSRCV is the PRESS returned by a given cross-validation process), but with a different interpretation and justification.

Function cppca estimates dfc from the following rule of thumb:

the ratio N / (N - dfc) is assumed being approximately equal to the ratio SSRCV / SSR.

Following this, it comes that

dfc = N * (1 - SSR / SSRCV)

In our case, the justification and interpretation of the additional part theta in dfc has different interpretation and justification than in Hassani et al. 2012: theta comes from the non-independence between Xnew and Xnew_fit, not from some correlation cost of computing PCA loadings ans scores (which is, for us, already accounted for in df).

Usage


cppca(X, ncomp, algo = NULL,
  segm,
  type = c("aicc", "aic", "bic"), 
  k = 5,
  print = TRUE, ...)

Arguments

X

A n x p matrix or data frame of training observations.

ncomp

The maximal number of scores (= components = latent variables) to consider.

algo

a PCA algorithm. Default to NULL (see pca).

segm

A list of the test segments. Typically, output of function segmkf or segmts.

type

Type of penalty coefficient. Possible values are "aicc" (default), "aic" or "bic". See Description section.

k

Dimension of the PCA model used for estimating "sigma2".

print

Logical. If TRUE, fitting information are printed.

...

Optionnal arguments to pass in the function defined in algo.

Value

A list of items. In particular, a data.frame with the estimated Cp and corresponding model weights (so-called "Akaike weights").

References

Efron, B., 2004. The Estimation of Prediction Error. Journal of the American Statistical Association 99, 619–632. https://doi.org/10.1198/016214504000000692

Hassani, S., Martens, H., Qannari, E.M., Kohler, A., 2012. Degrees of freedom estimation in Principal Component Analysis and Consensus Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 118, 246-259. https://doi.org/10.1016/j.chemolab.2012.05.015

Faber, N.M., Buydens, L.M.C., Kateman, G., 1994. Aspects of pseudorank estimation methods based on the eigenvalues of principal component analysis of random matrices. Chemometrics and Intelligent Laboratory Systems 25, 203–226. https://doi.org/10.1016/0169-7439(94)85043-7

Faber, N. (Klaas) M., 2008. Degrees of freedom for the residuals of a principal component analysis — A clarification. Chemometrics and Intelligent Laboratory Systems 93, 80–86. https://doi.org/10.1016/j.chemolab.2008.04.006

Josse, J., Husson, F., 2012. Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis 56, 1869–1879. https://doi.org/10.1016/j.csda.2011.11.012

Mallows, C.L., 1973. Some Comments on Cp. Technometrics 15, 661–675. https://doi.org/10.1080/00401706.1973.10489103

Ye, J., 1998. On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, 120–131. https://doi.org/10.1080/01621459.1998.10474094

Examples


data(datoctane)
X <- datoctane$X
## removing outliers
zX <- X[-c(25:26, 36:39), ]
n <- nrow(zX)
p <- ncol(zX)
N <- n * p
plotsp(zX)

K <- 5
segm <- segmkf(n = n, K = K, nrep = 1, seed = 1)
ncomp <- 15
fm <- cppca(zX, ncomp = ncomp, 
  segm = segm,
  type = "aicc"
  )
names(fm)
z <- fm$res
u <- selwold(z$crit[-1], start = 1,
             xlab = "Nb. components", main = "Cp", alpha = 0)


mlesnoff/rnirs documentation built on April 24, 2023, 4:17 a.m.