| estim_ncpPCA | R Documentation | 
Estimate the number of dimensions for the Principal Component Analysis by cross-validation
estim_ncpPCA(X, ncp.min = 0, ncp.max = 5, method = c("Regularized","EM"), 
       scale = TRUE, method.cv = c("gcv","loo","Kfold"), nbsim = 100, 
	   pNA = 0.05, ind.sup=NULL, quanti.sup=NULL, quali.sup=NULL,
	   threshold=1e-4, verbose = TRUE)
| X | a data.frame with continuous variables; with missing entries or not | 
| ncp.min | integer corresponding to the minimum number of components to test | 
| ncp.max | integer corresponding to the maximum number of components to test | 
| method | "Regularized" by default or "EM" | 
| scale | boolean. TRUE implies a same weight for each variable | 
| method.cv | string with the values "gcv" for generalised cross-validation, "loo" for leave-one-out or "Kfold" cross-validation | 
| nbsim | number of simulations, useful only if method.cv="Kfold" | 
| pNA | percentage of missing values added in the data set, useful only if method.cv="Kfold" | 
| ind.sup | a vector indicating the indexes of the supplementary individuals | 
| quanti.sup | a vector indicating the indexes of the quantitative supplementary variables | 
| quali.sup | a vector indicating the indexes of the categorical supplementary variables | 
| threshold | the threshold for assessing convergence | 
| verbose | boolean. TRUE means that a progressbar is writtent | 
For leave-one-out (loo) cross-validation, each cell of the data matrix is alternatively removed and predicted with a PCA model using ncp.min to ncp.max dimensions. The number of components which leads to the smallest mean square error of prediction (MSEP) is retained.
For the Kfold cross-validation, pNA percentage of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP is retained. 
For both cross-validation methods, missing entries are predicted using the imputePCA function, it means using the regularized iterative PCA algorithm (method="Regularized") or the iterative PCA algorithm (method="EM"). The regularized version is more appropriate when there are already many missing values in the dataset to avoid overfitting issues.
Cross-validation (especially method.cv="loo") is time-consuming. The generalised cross-validation criterion (method.cv="gcv") can be seen as an approximation of the loo cross-validation criterion which provides a straightforward way to estimate the number of dimensions without resorting to a computationally intensive method. 
This argument scale has to be chosen in agreement with the PCA that will be performed. If one wants to perform a normed PCA (where the variables are centered and scaled, i.e. divided by their standard deviation), then the argument scale has to be set to the value TRUE.
| ncp | the number of components retained for the PCA | 
| criterion | the criterion (the MSEP) calculated for each number of components | 
Francois Husson francois.husson@institut-agro.fr and Julie Josse julie.josse@polytechnique.edu
Bro, R., Kjeldahl, K. Smilde, A. K. and Kiers, H. A. L. (2008) Cross-validation of component models: A critical look at current methods. Analytical and Bioanalytical Chemistry, 5, 1241-1251.
Josse, J. and Husson, F. (2011). Selecting the number of components in PCA using cross-validation approximations. Computational Statistics and Data Analysis. 56 (6), pp. 1869-1879.
imputePCA
## Not run: 
data(orange)
nb <- estim_ncpPCA(orange,ncp.min=0,ncp.max=4) 
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.