Impute missing values

Share:

Description

Imputes missing values in a data matrix using the K-nearest neighbor algorithm.

Usage

1
2
imputeKNN(data, k = 10, distance = "euclidean", rm.na = TRUE, rm.nan =
TRUE, rm.inf = TRUE )

Arguments

data

a data matrix

k

number of neighbors to use

distance

distance metric to use, one of "euclidean" or "correlation"

rm.na

should NA values be imputed?

rm.nan

should NaN values be imputed?

rm.inf

should Inf values be imputed?

Details

Uses the K-nearest neighbor algorithm, as described in Troyanskaya et al., 2001, to impute missing values in a data matrix. Elements are imputed row-wise, so that neighbors are selected based on the rows which are closest in distance to the row with missing values. There are two choices for a distance metric, either Euclidean (the default) or a correlation 'metric'. If the latter is selected, matrix values are first row-normalized to mean zero and standard deviation one to select neighbors. Values are 'un'-normalized by applying the inverse transformation prior to returning the imputed data matrix.

Value

A data matrix with missing values imputed.

Author(s)

Guy Brock

References

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520-5, 2001.

G.N. Brock, J.R. Shaffer, R.E. Blakesley, M.J. Lotz, and G.C. Tseng. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics, 9:12, 2008.

See Also

See the package vignette for illustration on usage.

Examples

1
2
3
4
5
6
## generate some fake data and impute MVs
set.seed(101)
mat <- matrix(rnorm(500), nrow=100, ncol=5)
idx.mv <- sample(1:length(mat), 50, replace=FALSE)
mat[idx.mv] <- NA
imputed <- imputeKNN(mat)