imputeMFA | R Documentation |
Impute the missing values of a dataset with Multiple Factor Analysis (MFA). The variables are structured a priori into groups of variables. The variables can be continuous or categorical but within a group the nature of the variables is the same. Can be used as a preliminary step before performing MFA on an incomplete dataset.
imputeMFA(X, group, ncp = 2, type=rep("s",length(group)), method = c("Regularized","EM"),
row.w = NULL, coeff.ridge = 1,threshold = 1e-06, ind.sup = NULL,
num.group.sup = NULL, seed = NULL, maxiter = 1000, ...)
X |
a data.frame with groups of continuous or categorical variables containing missing values |
group |
a vector indicating the number of variables in each group |
ncp |
integer corresponding to the number of components used to predict the missing entries |
type |
the type of variables in each group; three possibilities: "c" or "s" for continuous variables (for "c" the variables are centered and for "s" variables are scaled to unit variance), "n" for categorical variables |
method |
"Regularized" by default or "EM" |
row.w |
row weights (by default, a vector of 1 for uniform row weights) |
coeff.ridge |
1 by default to perform the regularized imputeMFA algorithm; useful only if method="Regularized". Other regularization terms can be implemented by setting the value to less than 1 in order to regularized less (to get closer to the results of the EM method) or more than 1 to regularized more |
threshold |
the threshold for assessing convergence |
ind.sup |
a vector indicating the indexes of the supplementary individuals |
num.group.sup |
a vector indicating the group of variables that are supplementary |
seed |
integer, by default seed = NULL implies that missing values are initially imputed by the mean of each variable for the continuous variables and by the proportion of the category for the categorical variables coded with indicator matrices of dummy variables. Other values leads to a random initialization |
maxiter |
integer, maximum number of iteration for the algorithm |
... |
further arguments passed to or from other methods |
Impute the missing entries of a data with groups of variables using the iterative MFA algorithm (method="EM") or the regularised iterative MFA algorithm (method="Regularized"). The (regularized) iterative MFA algorithm first consists in coding the categorical variables using the indicator matrix
of dummy variables. Then, in the initialization step, missing values are imputed with initial values such as the mean of the variable for the continuous variables and the proportion of the category for each category using the non-missing entries. If the argument seed is set to a specific value, a random initialization is performed: the initial values are drawn from a gaussian distribution
with mean and standard deviation calculated from the observed values for each continuous variable. The second step of the (regularized) iterative MFA algorithm is to perform MFA on the completed dataset. Then, it imputes the missing values with the (regularized) reconstruction formulae of order ncp (the fitted matrix computed with ncp components for the (regularized) scores and loadings). These steps of estimation of the parameters via MFA and imputation of the missing values using the (regularized) fitted matrix are iterate until convergence.
We advice to use the regularized version of the algorithm to avoid the overfitting problems which are very frequent when there are many missing values. In the regularized algorithm, the singular values of the MFA are shrinked.
The output of the algorithm can be used as an input of the MFA function of the FactoMineR package in order to perform the MFA on an incomplete dataset.
tab.disj |
the imputed matrix; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. The categorical variables are coded with the indicator matrix of dummy variables. In this indicator matrix, the imputed values are real numbers but they met the constraint that the sum of the entries corresponding to one individual and one variable is equal to one. Consequently they can be seen as degree of membership to the corresponding category |
completeObs |
the imputed dataset; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. For the continuous variables, the values are the same as in the tab.disj output; for the categorical variables missing values are imputed with the most plausible categories according to the values in the tab.disj output |
call |
the matched call |
Francois Husson francois.husson@institut-agro.fr and Julie Josse julie.josse@polytechnique.edu
F. Husson, J. Josse (2013) Handling missing values in multiple factor analysis. Food Quality and Preferences, 30 (2), 77-85.
Josse, J. and Husson, F. missMDA (2016). A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70 (1), pp 1-31 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v070.i01")}
imputePCA
## Not run:
data(orange)
## Impute the data and perform a MFA
## with groups of continuous variables only
res.impute <- imputeMFA(orange, group=c(5,3), type=rep("s",2),ncp=2)
res.mfa <- MFA(res.impute$completeObs,group=c(5,3),type=rep("s",2))
## End(Not run)
## Not run:
data(vnf)
## Impute the indicator matrix and perform a MFA
## with groups of categorical variables only
res.comp <- imputeMFA(vnf,group=c(6,5,3),type=c("n","n","n"),ncp=2)
require(FactoMineR)
res.mfa <- MFA(vnf,group=c(6,5,3),type=c("n","n","n"),tab.comp=res.comp)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.