imputeMCA: Impute categorical dataset

View source: R/imputeMCA.R

imputeMCAR Documentation

Impute categorical dataset

Description

Impute the missing values of a categorical dataset using Multiple Correspondence Analysis (MCA). Can be used as a preliminary step before performing MCA on an incomplete dataset.

Usage

imputeMCA(don, ncp=2, method = c("Regularized","EM"), row.w=NULL, coeff.ridge=1, 
    threshold=1e-06, ind.sup = NULL, quanti.sup=NULL, quali.sup=NULL,
	seed=NULL, maxiter=1000)

Arguments

don

a data.frame with categorical variables containing missing values

ncp

integer corresponding to the number of dimensions used to predict the missing entries

method

"Regularized" by default or "EM"

row.w

row weights (by default, a vector of 1 for uniform row weights)

coeff.ridge

1 by default to perform the regularized imputeMCA algorithm; useful only if method="Regularized". Other regularization terms can be implemented by setting the value to less than 1 in order to regularized less (to get closer to the results of the EM method) or more than 1 to regularized more (to get closer to the results of the proportion imputation)

threshold

the threshold for assessing convergence

ind.sup

a vector indicating the indexes of the supplementary individuals

quanti.sup

a vector indicating the indexes of the quantitative supplementary variables

quali.sup

a vector indicating the indexes of the categorical supplementary variables

seed

integer, by default seed = NULL implies that missing values are initially imputed by the proportion of the category for the categorical variables coded with indicator matrices of dummy variables. Other values leads to a random initialization

maxiter

integer, maximum number of iterations for the regularized iterative MCA algorithm

Details

Impute the missing entries of a categorical data using the iterative MCA algorithm (method="EM") or the regularised iterative MCA algorithm (method="Regularized"). The (regularized) iterative MCA algorithm first consists in coding the categorical variables using the indicator matrix of dummy variables. Then, in the initialization step, missing values are imputed with initial values such as the proportion of the category for each category using the non-missing entries. This imputation corresponds also to using the algorithm with ncp=0 and is sometimes called in the literature the "missing fuzzy average method". If the argument seed is set to a specific value, a random initialization is performed: random values are drawn in such a way that the constraint that the sum of the entries corresponding to one individual and one variable is equal to one in the indicator matrix of dummy variables. The second step of the (regularized) iterative MCA algorithm consists in performing MCA on the completed dataset. Then, it imputes the missing values with the (regularized) reconstruction formulae of order ncp (the fitted matrix computed with ncp components for the (regularized) scores and loadings). These steps of estimation of the parameters via MCA and imputation of the missing values using the (regularized) fitted matrix are iterate until convergence.
We advice to use the regularized version of the algorithm to avoid the overfitting problems which are very frequent when there are many missing values. In the regularized algorithm, the singular values of the MCA are shrinked.
The number of components ncp used in the algorithm can be selected using the function ncpMCA. A small number of components can also be seen as a way to regularize more and consequently may be advices to get more stable predictions.
The output of the algorithm can be used as an input of the MCA function of the FactoMineR package in order to perform MCA on an incomplete dataset.

Value

tab.disj

The imputed indicator matrix; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. The imputed values are real numbers but they but they met the constraint that the sum of the entries corresponding to one individual and one variable is equal to one. Consequently they can be seen as degree of membership to the corresponding category.

completeObs

The categorical imputed dataset; the observed values are kept for the non-missing entries and the missing values are replaced by the predicted ones. Missing values are imputed with the most plausible categories according to the values in the tab.disj output

Author(s)

Francois Husson francois.husson@institut-agro.fr and Julie Josse julie.josse@polytechnique.edu

References

Josse, J., Chavent, M., Liquet, B. and Husson, F. (2010). Handling missing values with Regularized Iterative Multiple Correspondence Analysis, Journal of Clcassification, 29 (1), pp. 91-116.
Josse, J. and Husson, F. missMDA (2016). A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70 (1), pp 1-31 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v070.i01")}

See Also

estim_ncpMCA,
Video showing how to perform MCA on an incomplete dataset

Examples

## Not run: 
data(vnf)
## First the number of components has to be chosen 
##   (for the reconstruction step)
## nb <- estim_ncpMCA(vnf,ncp.max=5) ## Time-consuming, nb = 4

## Impute the indicator matrix and perform a MCA
res.impute <- imputeMCA(vnf, ncp=4)

## The imputed indicator matrix can be used as an input of the MCA function of the
## FactoMineR package to perform the MCA on the incomplete data vnf 
require(FactoMineR)
res.mca <- MCA(vnf,tab.disj=res.impute$tab.disj) 

## With supplementary variables (var 11 to 14), impute the active ones
res.impute <- imputeMCA(vnf[,1:10], ncp=4)
res.mca <- MCA(vnf,tab.disj=res.impute$tab.disj,quali.sup=11:14) 

## End(Not run)

missMDA documentation built on Nov. 17, 2023, 5:07 p.m.