cppca | R Documentation |

Calculation of a Mallows's `Cp`

(Mallows 1973) criterion for PCA models. This function will probably be modified in the future (this is a work in progress).

For a model with `a`

components and a matrix `X`

of dimension `n x p`

, function `cppca`

calculates `Cp`

by:

`Cp(a) = SSR(a) / N + alpha * dfc(a) * sigma^2 / N`

where `SSR(a)`

is the sum of squared residuals for the current PCA model, `dfc(a)`

an estimation of the PCA "corrected model complexity" (see thereafter), `sigma^2`

the irreductible error variance estimated from a low biased model, `alpha`

a penalty coefficient, and `N = n * p`

the number of (training) matrix elements.

**## Penalty coefficient alpha**

Depending on argument `type`

, `alpha`

is equal to either `2`

(AIC penalty), `2 * N / (N - df - 1)`

(small sample size correction AICc penalty) or `log(N)`

(BIC penalty).

**## Corrected nb. model's degrees of freedom dfc**

The model complexity of a fitted PCA model of dimension `a`

is known to be `df = p + a * (n - 1) + p * a - a^2`

(Faber et al. 1994, Faber 2012, Josse & Husson 2012). Monte Carlo estimates (Ye 1998, Efron 2004) of `df`

are consistent with this previous formula; see for instance the example presented for function `dfpca_div`

.

Nevertheless and unfortunately, `df`

can not be used directly in criteria such as `Cp`

, `GCV`

etc. for model selection. Such a procedure underestimates the prediction error that is normally targeted by these criteria, with the consequence of selecting models with too large dimensions. This is the same problem that encounters the row-wise PCA CV.

The underestimation comes from the fact that, in PCA, the new observations to predict, say `Xnew`

, are directly used for computing the predictions `Xnew_fit`

(`Xnew_fit = Xnew * P * P'`

). Objects `Xnew`

and `Xnew_fit`

are therefore not independent, while such an independence is assumed in the theory leading to the good properties of `Cp`

, `GCV`

, `MSPEPCV`

, etc. As a comparison, in usual supervised prediction models such as LMR or PLSR models, objects `ynew`

and `ynew_fit`

are independent.

The non-independence between `Xnew`

and `Xnew_fit`

consumes specific degrees of freedom that need to be added to `df`

before using the Cp formula. Let us define the **corrected** number of degrees of freedom (used finally in the `Cp`

formula) by `dfc = df + theta`

, where `theta`

corresponds to the additional part. We need to estimate `dfc`

(or `theta`

since `df`

is known).

The approach propoed in rnirs (publication should come) uses the nice idea of Hassani et al. 2012 that compared SSR to SSRCV (where `SSRCV`

is the `PRESS`

returned by a given cross-validation process), but with a different interpretation and justification.

Function `cppca`

estimates `dfc`

from the following rule of thumb:

the ratio `N / (N - dfc)`

is assumed being approximately equal to the ratio `SSRCV / SSR`

.

Following this, it comes that

`dfc = N * (1 - SSR / SSRCV)`

In our case, the justification and interpretation of the additional part `theta`

in `dfc`

has different interpretation and justification than in Hassani et al. 2012: `theta`

comes from the non-independence between `Xnew`

and `Xnew_fit`

, not from some correlation cost of computing PCA loadings ans scores (which is, for us, already accounted for in `df`

).

```
cppca(X, ncomp, algo = NULL,
segm,
type = c("aicc", "aic", "bic"),
k = 5,
print = TRUE, ...)
```

`X` |
A |

`ncomp` |
The maximal number of scores (= components = latent variables) to consider. |

`algo` |
a PCA algorithm. Default to |

`segm` |
A list of the test segments. Typically, output of function |

`type` |
Type of penalty coefficient. Possible values are |

`k` |
Dimension of the PCA model used for estimating |

`print` |
Logical. If |

`...` |
Optionnal arguments to pass in the function defined in |

A list of items. In particular, a data.frame with the estimated `Cp`

and corresponding model weights (so-called "Akaike weights").

Efron, B., 2004. The Estimation of Prediction Error. Journal of the American Statistical Association 99, 619â632. https://doi.org/10.1198/016214504000000692

Hassani, S., Martens, H., Qannari, E.M., Kohler, A., 2012. Degrees of freedom estimation in Principal Component Analysis and Consensus Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 118, 246-259. https://doi.org/10.1016/j.chemolab.2012.05.015

Faber, N.M., Buydens, L.M.C., Kateman, G., 1994. Aspects of pseudorank estimation methods based on the eigenvalues of principal component analysis of random matrices. Chemometrics and Intelligent Laboratory Systems 25, 203â226. https://doi.org/10.1016/0169-7439(94)85043-7

Faber, N. (Klaas) M., 2008. Degrees of freedom for the residuals of a principal component analysis â A clarification. Chemometrics and Intelligent Laboratory Systems 93, 80â86. https://doi.org/10.1016/j.chemolab.2008.04.006

Josse, J., Husson, F., 2012. Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis 56, 1869â1879. https://doi.org/10.1016/j.csda.2011.11.012

Mallows, C.L., 1973. Some Comments on Cp. Technometrics 15, 661â675. https://doi.org/10.1080/00401706.1973.10489103

Ye, J., 1998. On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, 120â131. https://doi.org/10.1080/01621459.1998.10474094

```
data(datoctane)
X <- datoctane$X
## removing outliers
zX <- X[-c(25:26, 36:39), ]
n <- nrow(zX)
p <- ncol(zX)
N <- n * p
plotsp(zX)
K <- 5
segm <- segmkf(n = n, K = K, nrep = 1, seed = 1)
ncomp <- 15
fm <- cppca(zX, ncomp = ncomp,
segm = segm,
type = "aicc"
)
names(fm)
z <- fm$res
u <- selwold(z$crit[-1], start = 1,
xlab = "Nb. components", main = "Cp", alpha = 0)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.