ForImp.Mahala: Imputation of missing data by using Nearest Neighbour...
In GenForImp: The Forward Imputation: A Sequential Distance-Based Approach for Imputing Missing Data

Description Usage Arguments Details Value Author(s) References See Also Examples

This function imputes quantitative missing data by using Nearest Neighbour Imputation (NNI) with the Mahalanobis distance in a forward and sequential step-by-step process that starts from the complete part of data.

1 2	ForImp.Mahala(mat, probs=seq(0, 1, 0.1), q="10%", add.unit=TRUE, squared=FALSE, tol=1e-6)

`mat`	a quantitative data matrix with missing entries.
`probs`	vector of probabilities with values in [0, 1] for computing quantiles of Mahalanobis distances in selection of donors. Default option: `probs=seq(0,1,0.1)` calculates the deciles of distances. Quantiles are computed with the generic function `quantile`.
`q`	string of the form `"X%"`, with `X`=integer. It gives the quantile of Mahalanobis distances corresponding to the first `"X%"` distances as computed (and named) by the function `quantile` with probabilities specified in the argument `probs`.
`add.unit`	a logical value. If `add.unit=TRUE` (default), the covariance matrix in the Mahalanobis distance is computed at every step of the procedure by including also the incomplete unit whose donors are to be selected. Otherwise, `add.unit=FALSE` indicates that computation involves the complete units only.
`squared`	a logical value indicating if the Mahalanobis distance has to be used (`squared=` `FALSE`, default) or the squared Mahalanobis distance (`squared=TRUE`).
`tol`	tolerance factor introduced to prevent numerical problems occuring when distances of complete units are equal to the choosen quantile `q`. Default is `tol=1e-6`.

ForImp.Mahala is a forward imputation method alternative to the ForImp.PCA procedure for imputing quantitative missing data (see ForImp.PCA). It does not embrace Stage 1 since it works directly on the original variables. Regarding Stage 2, the basic metric for the NNI method is the Mahalanobis distance. Steps 2 to 3 are therefore iteratively repeated until the starting data matrix is completely imputed.

Unlike ForImp.PCA, the ForImp.Mahala procedure requires that the number n of units is equal or greater than the number p of variables at every step of the procedure, otherwise the covariance matrix involved in the Mahalanobis distance is not invertible. For further details, see the references below.

The imputed data matrix.

Nadia Solaro, Alessandro Barbiero, Giancarlo Manzi, Pier Alda Ferrari

Solaro, N., Barbiero, A., Manzi. G., Ferrari, P.A. (2014). Algorithmic-type imputation techniques with different data structures: Alternative approaches in comparison. In: Vicari, D., Okada, A., Ragozini, G., Weihs, C. (eds), Analysis and modeling of complex data in behavioural and social sciences, Studies in Classification, Data Analysis, and Knowledge Organization. Springer International Publishing, Cham (CH): 253-261

Solaro, N., Barbiero, A., Manzi, G., Ferrari, P.A. (2015) A sequential distance-based approach for imputing missing data: The Forward Imputation. Under review

ForImp.PCA

# EXAMPLE with multivariate normal data (MVN)
# require('mvtnorm')
# number of variables
p <- 5
# correlation matrix
rho <- 0.8
Rho <- matrix(rho, p, p)
diag(Rho) <- 1
Rho
# mean vector
vmean <- rep(0,p)
vmean
# number of units
n <- 1000
# percentage of missing values
percmiss <- 0.2
nummiss <- n*p*percmiss
nummiss
# generation of a complete matrix
set.seed(1)
x0 <- rmvnorm(n, mean=vmean, sigma=Rho)
x0
# generating a matrix with missing data
x <- missing.gen(x0, nummiss) 
# imputing missing values
xForImpMahala <- ForImp.Mahala(x)
xForImpMahala
# computing the Relative Mean Square Error
error <- sum(apply((x0-xForImpMahala)^2/diag(var(x0)),2,sum)) / n
error


# EXAMPLE with real data
data(airquality)
m0 <- airquality
m0
# selecting the first 4 columns, with quantitative data
m <- m0[, 1:4]
m
# imputation
mi <- ForImp.Mahala(m)
mi
# plot of imputed values for variable "Ozone"
ozone.miss.ind <- which(is.na(m)[,1])
plot(mi[ozone.miss.ind,1], axes=FALSE, pch=19, ylab="imputed values of Ozone", 
  xlab="observation index")
axis(2)
axis(1, at=1:length(ozone.miss.ind), labels=ozone.miss.ind, las=2)
box()
abline(v=1:length(ozone.miss.ind), lty=3, col="grey")