| knn_imp | R Documentation |
Impute missing values in a numeric matrix using k-nearest neighbors (K-NN).
knn_imp(
obj,
k,
colmax = 0.9,
method = c("euclidean", "manhattan"),
cores = 1,
post_imp = TRUE,
subset = NULL,
dist_pow = 0,
tree = FALSE,
max_cache = 4,
na_check = TRUE
)
obj |
A numeric matrix with samples in rows and features in columns. |
k |
Integer. Number of nearest neighbors for imputation. 10 is a good starting point. |
colmax |
Numeric. A number from 0 to 1. Threshold of column-wise missing data rate above which imputation is skipped. |
method |
Character. Either "euclidean" (default) or "manhattan". Distance metric for nearest neighbor calculation. |
cores |
Integer. Number of cores for K-NN parallelization (OpenMP). On macOS, OpenMP may need additional compiler configuration. |
post_imp |
Boolean. Whether to impute remaining missing values (those that failed imputation) using column means. |
subset |
Character. Vector of column names or integer vector of column indices specifying which columns to impute. |
dist_pow |
Numeric. The amount of penalization for further away nearest
neighbors in the weighted average. |
tree |
Logical. |
max_cache |
Numeric. Maximum allowed cache size in GB (default |
na_check |
Boolean. Check for leftover |
This function performs imputation column-wise (using rows as observations).
When dist_pow > 0, imputed values are computed as distance-weighted
averages where weights are inverse distances raised to the power of
dist_pow.
The tree parameter (when TRUE) uses a BallTree for faster neighbor search
via {mlpack} but requires pre-filling missing values with column means.
This can introduce a small bias when missingness is high.
A numeric matrix of the same dimensions as obj with missing
values imputed.
tree = FALSE (default, brute-force K-NN): Always safe and usually
faster for small to moderate data or high-dimensional cases.
tree = TRUE (BallTree K-NN): Only use when imputation run time
becomes prohibitive and missingness is low (<5% missing).
Subset imputation: Use the subset parameter for efficiency when only
specific columns need imputation (e.g., epigenetic clock CpGs).
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6): 520-525.
# Basic K-NN imputation
obj <- sim_mat(20, 20, perc_col_na = 1)$input
sum(is.na(obj))
result <- knn_imp(obj, k = 10)
result
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.