cppca | R Documentation |
Calculation of a Mallows's Cp
(Mallows 1973) criterion for PCA models. This function will probably be modified in the future (this is a work in progress).
For a model with a
components and a matrix X
of dimension n x p
, function cppca
calculates Cp
by:
Cp(a) = SSR(a) / N + alpha * dfc(a) * sigma^2 / N
where SSR(a)
is the sum of squared residuals for the current PCA model, dfc(a)
an estimation of the PCA "corrected model complexity" (see thereafter), sigma^2
the irreductible error variance estimated from a low biased model, alpha
a penalty coefficient, and N = n * p
the number of (training) matrix elements.
## Penalty coefficient alpha
Depending on argument type
, alpha
is equal to either 2
(AIC penalty), 2 * N / (N - df - 1)
(small sample size correction AICc penalty) or log(N)
(BIC penalty).
## Corrected nb. model's degrees of freedom dfc
The model complexity of a fitted PCA model of dimension a
is known to be df = p + a * (n - 1) + p * a - a^2
(Faber et al. 1994, Faber 2012, Josse & Husson 2012). Monte Carlo estimates (Ye 1998, Efron 2004) of df
are consistent with this previous formula; see for instance the example presented for function dfpca_div
.
Nevertheless and unfortunately, df
can not be used directly in criteria such as Cp
, GCV
etc. for model selection. Such a procedure underestimates the prediction error that is normally targeted by these criteria, with the consequence of selecting models with too large dimensions. This is the same problem that encounters the row-wise PCA CV.
The underestimation comes from the fact that, in PCA, the new observations to predict, say Xnew
, are directly used for computing the predictions Xnew_fit
(Xnew_fit = Xnew * P * P'
). Objects Xnew
and Xnew_fit
are therefore not independent, while such an independence is assumed in the theory leading to the good properties of Cp
, GCV
, MSPEPCV
, etc. As a comparison, in usual supervised prediction models such as LMR or PLSR models, objects ynew
and ynew_fit
are independent.
The non-independence between Xnew
and Xnew_fit
consumes specific degrees of freedom that need to be added to df
before using the Cp formula. Let us define the corrected number of degrees of freedom (used finally in the Cp
formula) by dfc = df + theta
, where theta
corresponds to the additional part. We need to estimate dfc
(or theta
since df
is known).
The approach propoed in rnirs (publication should come) uses the nice idea of Hassani et al. 2012 that compared SSR to SSRCV (where SSRCV
is the PRESS
returned by a given cross-validation process), but with a different interpretation and justification.
Function cppca
estimates dfc
from the following rule of thumb:
the ratio N / (N - dfc)
is assumed being approximately equal to the ratio SSRCV / SSR
.
Following this, it comes that
dfc = N * (1 - SSR / SSRCV)
In our case, the justification and interpretation of the additional part theta
in dfc
has different interpretation and justification than in Hassani et al. 2012: theta
comes from the non-independence between Xnew
and Xnew_fit
, not from some correlation cost of computing PCA loadings ans scores (which is, for us, already accounted for in df
).
cppca(X, ncomp, algo = NULL,
segm,
type = c("aicc", "aic", "bic"),
k = 5,
print = TRUE, ...)
X |
A |
ncomp |
The maximal number of scores (= components = latent variables) to consider. |
algo |
a PCA algorithm. Default to |
segm |
A list of the test segments. Typically, output of function |
type |
Type of penalty coefficient. Possible values are |
k |
Dimension of the PCA model used for estimating |
print |
Logical. If |
... |
Optionnal arguments to pass in the function defined in |
A list of items. In particular, a data.frame with the estimated Cp
and corresponding model weights (so-called "Akaike weights").
Efron, B., 2004. The Estimation of Prediction Error. Journal of the American Statistical Association 99, 619â632. https://doi.org/10.1198/016214504000000692
Hassani, S., Martens, H., Qannari, E.M., Kohler, A., 2012. Degrees of freedom estimation in Principal Component Analysis and Consensus Principal Component Analysis. Chemometrics and Intelligent Laboratory Systems 118, 246-259. https://doi.org/10.1016/j.chemolab.2012.05.015
Faber, N.M., Buydens, L.M.C., Kateman, G., 1994. Aspects of pseudorank estimation methods based on the eigenvalues of principal component analysis of random matrices. Chemometrics and Intelligent Laboratory Systems 25, 203â226. https://doi.org/10.1016/0169-7439(94)85043-7
Faber, N. (Klaas) M., 2008. Degrees of freedom for the residuals of a principal component analysis â A clarification. Chemometrics and Intelligent Laboratory Systems 93, 80â86. https://doi.org/10.1016/j.chemolab.2008.04.006
Josse, J., Husson, F., 2012. Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics & Data Analysis 56, 1869â1879. https://doi.org/10.1016/j.csda.2011.11.012
Mallows, C.L., 1973. Some Comments on Cp. Technometrics 15, 661â675. https://doi.org/10.1080/00401706.1973.10489103
Ye, J., 1998. On Measuring and Correcting the Effects of Data Mining and Model Selection. Journal of the American Statistical Association 93, 120â131. https://doi.org/10.1080/01621459.1998.10474094
data(datoctane)
X <- datoctane$X
## removing outliers
zX <- X[-c(25:26, 36:39), ]
n <- nrow(zX)
p <- ncol(zX)
N <- n * p
plotsp(zX)
K <- 5
segm <- segmkf(n = n, K = K, nrep = 1, seed = 1)
ncomp <- 15
fm <- cppca(zX, ncomp = ncomp,
segm = segm,
type = "aicc"
)
names(fm)
z <- fm$res
u <- selwold(z$crit[-1], start = 1,
xlab = "Nb. components", main = "Cp", alpha = 0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.