estim_ncpPCA | R Documentation |
Estimate the number of dimensions for the Principal Component Analysis by cross-validation
estim_ncpPCA(X, ncp.min = 0, ncp.max = 5, method = c("Regularized","EM"),
scale = TRUE, method.cv = c("gcv","loo","Kfold"), nbsim = 100,
pNA = 0.05, ind.sup=NULL, quanti.sup=NULL, quali.sup=NULL,
threshold=1e-4, verbose = TRUE)
X |
a data.frame with continuous variables; with missing entries or not |
ncp.min |
integer corresponding to the minimum number of components to test |
ncp.max |
integer corresponding to the maximum number of components to test |
method |
"Regularized" by default or "EM" |
scale |
boolean. TRUE implies a same weight for each variable |
method.cv |
string with the values "gcv" for generalised cross-validation, "loo" for leave-one-out or "Kfold" cross-validation |
nbsim |
number of simulations, useful only if method.cv="Kfold" |
pNA |
percentage of missing values added in the data set, useful only if method.cv="Kfold" |
ind.sup |
a vector indicating the indexes of the supplementary individuals |
quanti.sup |
a vector indicating the indexes of the quantitative supplementary variables |
quali.sup |
a vector indicating the indexes of the categorical supplementary variables |
threshold |
the threshold for assessing convergence |
verbose |
boolean. TRUE means that a progressbar is writtent |
For leave-one-out (loo) cross-validation, each cell of the data matrix is alternatively removed and predicted with a PCA model using ncp.min to ncp.max dimensions. The number of components which leads to the smallest mean square error of prediction (MSEP) is retained.
For the Kfold cross-validation, pNA percentage of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP is retained.
For both cross-validation methods, missing entries are predicted using the imputePCA function, it means using the regularized iterative PCA algorithm (method="Regularized") or the iterative PCA algorithm (method="EM"). The regularized version is more appropriate when there are already many missing values in the dataset to avoid overfitting issues.
Cross-validation (especially method.cv="loo") is time-consuming. The generalised cross-validation criterion (method.cv="gcv") can be seen as an approximation of the loo cross-validation criterion which provides a straightforward way to estimate the number of dimensions without resorting to a computationally intensive method.
This argument scale has to be chosen in agreement with the PCA that will be performed. If one wants to perform a normed PCA (where the variables are centered and scaled, i.e. divided by their standard deviation), then the argument scale has to be set to the value TRUE.
ncp |
the number of components retained for the PCA |
criterion |
the criterion (the MSEP) calculated for each number of components |
Francois Husson francois.husson@institut-agro.fr and Julie Josse julie.josse@polytechnique.edu
Bro, R., Kjeldahl, K. Smilde, A. K. and Kiers, H. A. L. (2008) Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 5, 1241-1251.
Josse, J. and Husson, F. (2011). Selecting the number of components in PCA using cross-validation approximations. Computational Statistics and Data Analysis. 56 (6), pp. 1869-1879.
imputePCA
## Not run:
data(orange)
nb <- estim_ncpPCA(orange,ncp.min=0,ncp.max=4)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.