tune_imp: Tune Imputation Method Parameters
In slideimp: Numeric Matrices K-NN and PCA Imputation

tune_imp

R Documentation

Tune Imputation Method Parameters

Description

Tune method specific hyperparameters by repeatedly masking observed values, imputing them, and comparing the imputed values with the original values.

Usage

tune_imp(
  obj,
  parameters = NULL,
  .f,
  na_loc = NULL,
  num_na = NULL,
  n_reps = 1,
  n_cols = NULL,
  n_rows = 2,
  rowmax = 0.9,
  colmax = 0.9,
  na_col_subset = NULL,
  max_attempts = 100,
  .progress = TRUE,
  cores = 1,
  location = NULL,
  pin_blas = FALSE
)

Arguments

`obj`	A numeric matrix.
`parameters`	A `data.frame` specifying parameter combinations to tune. Each column should be a parameter accepted by `.f`, excluding `obj`. List-columns are supported for complex parameters. Duplicate rows are removed. `NULL` is treated as a single parameter set with no additional arguments, which is useful for functions whose required arguments all have defaults.
`.f`	One of `"knn_imp"`, `"pca_imp"`, or `"slide_imp"`, or a custom imputation function.
`na_loc`	Optional predefined missing-value locations. Accepted formats are a two-column integer matrix of row and column indices, a numeric vector of linear positions, or a list whose elements are either of those formats.
`num_na`	Integer or `NULL`. Total number of missing values to inject per repetition. If supplied, `n_cols` is derived from `num_na` and `n_rows`, and missing values are distributed as evenly as possible across columns.
`n_reps`	Integer. Number of independent repetitions.
`n_cols`	Integer or `NULL`. Must be supplied when both `num_na` and `na_loc` are `NULL`, unless the automatic default applies.
`n_rows`	Integer. Target number of missing values to inject per selected column.
`rowmax`	Numeric scalar between `0` and `1`. Maximum allowed missing-data proportion per row after injection.
`colmax`	Numeric scalar between `0` and `1`. Maximum allowed missing-data proportion per column after injection.
`na_col_subset`	Optional integer or character vector restricting which columns are eligible for missing-value injection.
`max_attempts`	Integer. Maximum number of resampling attempts per repetition before giving up.
`.progress`	Logical. If `TRUE`, show progress during tuning.
`cores`	Integer. Number of cores to use for K-NN and sliding-window K-NN imputation. For other methods, use `mirai::daemons()`.
`location`	Numeric vector of column locations. Required when `.f = "slide_imp"`.
`pin_blas`	Logical. If `TRUE`, pin BLAS threads to 1 during parallel tuning to reduce thread contention.

Details

Built-in methods can be selected by passing .f = "knn_imp", .f = "pca_imp", or .f = "slide_imp". A custom function can also be supplied. Custom functions must accept obj as their first argument and return a numeric matrix with the same dimensions as obj.

When na_loc is supplied, num_na, n_cols, n_rows, and na_col_subset are ignored.

When .f is a character string, columns in parameters are validated against the selected method:

"knn_imp" requires k.
"pca_imp" requires ncp.
"slide_imp" requires window_size, overlap_size, and min_window_n, plus exactly one of k or ncp.

To tune parameters for grouped imputation, tune knn_imp() or pca_imp() on representative groups, then pass the selected parameters to group_imp().

The top-level rowmax and colmax arguments control random missing-value injection performed by sample_na_loc(). To tune or pass an imputation method's own colmax argument, include a colmax column in parameters.

Tuning results can be summarized with compute_metrics() or evaluated with external packages such as yardstick.

Value

A data frame of class slideimp_tune containing:

columns originally provided in parameters;
param_set, an integer ID for each unique parameter combination;
rep_id, an integer repetition index;
result, a list-column where each element is a data frame containing truth and estimate columns;
error, a character column containing the error message if the iteration failed, otherwise NA.

Parallelization

K-NN: use the cores argument. If mirai daemons are active, cores is automatically set to 1 to avoid nested parallelism.
PCA: use mirai::daemons() instead of cores.

When running PCA imputation in parallel with mirai, set pin_blas = TRUE in tune_imp() or group_imp() to prevent BLAS threads from oversubscribing CPU cores. This relies on RhpcBLASctl and works with OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap). pin_blas = TRUE may have no effect on macOS.

PCA Performance tips

Speed comes from three levers: solver (through LOBPCG with warm-start), threshold, and scale. Tune these first, then accuracy parameters (ncp, coeff.ridge) on a representative subset.

Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact" depends on size and low-rankness: "lobpcg" is preferred for large, approximately low-rank matrices with small ncp, and "exact" for small matrices (including slide_imp() windows), where it is faster and more robust. Separately, the warm-start makes each successive solve cheap: pca_imp() warm-starts LOBPCG with the previous eigenblock and search direction, so once imputed values stabilize, later solves converge in a few iterations. The payoff therefore grows with the number of EM iterations, independent of low-rankness. solver = "auto" (default) probes both and is a safe start.

Threshold. The default 1e-6 is conservative; 1e-5 is often faster with very similar values.

Scale. For columns on a common scale (e.g., DNAm beta values in ⁠[0, 1]⁠), scale = FALSE can be faster and more accurate.

Parallel and BLAS. In parallel via tune_imp() or group_imp() with a multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription. On Windows, the stock BLAS can be slow. Advanced users can swap in OpenBLAS.

See Speeding up PCA imputation for the full workflow.

Examples

set.seed(123)

# Simulate some data
obj <- sim_mat(10, 50)$input

# Tune K-NN imputation with random missing-value injection.
# Use larger `num_na` and `n_reps` values for real analyses.
params_knn <- data.frame(k = c(2, 4))
results <- tune_imp(
  obj,
  params_knn,
  .f = "knn_imp",
  n_reps = 1,
  num_na = 10,
  .progress = FALSE
)
compute_metrics(results)

# Tune with fixed missing-value positions
na_positions <- list(
  matrix(c(1, 2, 3, 1, 1, 1), ncol = 2),
  matrix(c(2, 3, 4, 2, 2, 2), ncol = 2)
)

results_fixed <- tune_imp(
  obj,
  data.frame(k = 2),
  .f = "knn_imp",
  na_loc = na_positions,
  .progress = FALSE
)

# Custom imputation function
custom_fill <- function(obj, val = 0) {
  obj[is.na(obj)] <- val
  obj
}

tune_imp(
  obj,
  data.frame(val = c(0, 1)),
  .f = custom_fill,
  num_na = 10,
  .progress = FALSE
)


# Parallel tuning with mirai
mirai::daemons(2)

parameters_custom <- data.frame(mean = c(0, 1), sd = c(1, 1))

custom_imp <- function(obj, mean, sd) {
  na_pos <- is.na(obj)
  obj[na_pos] <- stats::rnorm(sum(na_pos), mean = mean, sd = sd)
  obj
}

results_p <- tune_imp(
  obj,
  parameters_custom,
  .f = custom_imp,
  n_reps = 1,
  num_na = 10,
  .progress = FALSE
)

mirai::daemons(0)

slideimp documentation built on June 17, 2026, 1:08 a.m.