| tune_imp | R Documentation |
Tune method specific hyperparameters by repeatedly masking observed values, imputing them, and comparing the imputed values with the original values.
tune_imp(
obj,
parameters = NULL,
.f,
na_loc = NULL,
num_na = NULL,
n_reps = 1,
n_cols = NULL,
n_rows = 2,
rowmax = 0.9,
colmax = 0.9,
na_col_subset = NULL,
max_attempts = 100,
.progress = TRUE,
cores = 1,
location = NULL,
pin_blas = FALSE
)
obj |
A numeric matrix. |
parameters |
A |
.f |
One of |
na_loc |
Optional predefined missing-value locations. Accepted formats are a two-column integer matrix of row and column indices, a numeric vector of linear positions, or a list whose elements are either of those formats. |
num_na |
Integer or |
n_reps |
Integer. Number of independent repetitions. |
n_cols |
Integer or |
n_rows |
Integer. Target number of missing values to inject per selected column. |
rowmax |
Numeric scalar between |
colmax |
Numeric scalar between |
na_col_subset |
Optional integer or character vector restricting which columns are eligible for missing-value injection. |
max_attempts |
Integer. Maximum number of resampling attempts per repetition before giving up. |
.progress |
Logical. If |
cores |
Integer. Number of cores to use for K-NN and sliding-window
K-NN imputation. For other methods, use |
location |
Numeric vector of column locations. Required when
|
pin_blas |
Logical. If |
Built-in methods can be selected by passing .f = "knn_imp",
.f = "pca_imp", or .f = "slide_imp". A custom function can also be
supplied. Custom functions must accept obj as their first argument and
return a numeric matrix with the same dimensions as obj.
When na_loc is supplied, num_na, n_cols, n_rows, and na_col_subset
are ignored.
When .f is a character string, columns in parameters are validated
against the selected method:
"knn_imp" requires k.
"pca_imp" requires ncp.
"slide_imp" requires window_size, overlap_size, and min_window_n,
plus exactly one of k or ncp.
To tune parameters for grouped imputation, tune knn_imp() or pca_imp()
on representative groups, then pass the selected parameters to group_imp().
The top-level rowmax and colmax arguments control random missing-value
injection performed by sample_na_loc(). To tune or pass an imputation
method's own colmax argument, include a colmax column in parameters.
Tuning results can be summarized with compute_metrics() or evaluated with
external packages such as yardstick.
A data frame of class slideimp_tune containing:
columns originally provided in parameters;
param_set, an integer ID for each unique parameter combination;
rep_id, an integer repetition index;
result, a list-column where each element is a data frame containing
truth and estimate columns;
error, a character column containing the error message if the
iteration failed, otherwise NA.
K-NN: use the cores argument. If mirai daemons are active, cores
is automatically set to 1 to avoid nested parallelism.
PCA: use mirai::daemons() instead of cores.
When running PCA imputation in parallel with mirai, set pin_blas = TRUE
in tune_imp() or group_imp() to prevent BLAS threads from
oversubscribing CPU cores. This relies on RhpcBLASctl and works with
OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap).
pin_blas = TRUE may have no effect on macOS.
Speed comes from three levers: solver (through LOBPCG with warm-start),
threshold, and scale. Tune these first, then accuracy parameters
(ncp, coeff.ridge) on a representative subset.
Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact"
depends on size and low-rankness: "lobpcg" is preferred for large, approximately
low-rank matrices with small ncp, and "exact" for small matrices
(including slide_imp() windows), where it is faster and more robust.
Separately, the warm-start makes each successive solve cheap: pca_imp()
warm-starts LOBPCG with the previous eigenblock and search direction, so once
imputed values stabilize, later solves converge in a few iterations. The
payoff therefore grows with the number of EM iterations, independent of
low-rankness. solver = "auto" (default) probes both and is a safe start.
Threshold. The default 1e-6 is conservative; 1e-5 is often faster
with very similar values.
Scale. For columns on a common scale (e.g., DNAm beta values in
[0, 1]), scale = FALSE can be faster and more accurate.
Parallel and BLAS. In parallel via tune_imp() or group_imp() with a
multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription.
On Windows, the stock BLAS can be slow. Advanced users can swap in
OpenBLAS.
See Speeding up PCA imputation for the full workflow.
set.seed(123)
# Simulate some data
obj <- sim_mat(10, 50)$input
# Tune K-NN imputation with random missing-value injection.
# Use larger `num_na` and `n_reps` values for real analyses.
params_knn <- data.frame(k = c(2, 4))
results <- tune_imp(
obj,
params_knn,
.f = "knn_imp",
n_reps = 1,
num_na = 10,
.progress = FALSE
)
compute_metrics(results)
# Tune with fixed missing-value positions
na_positions <- list(
matrix(c(1, 2, 3, 1, 1, 1), ncol = 2),
matrix(c(2, 3, 4, 2, 2, 2), ncol = 2)
)
results_fixed <- tune_imp(
obj,
data.frame(k = 2),
.f = "knn_imp",
na_loc = na_positions,
.progress = FALSE
)
# Custom imputation function
custom_fill <- function(obj, val = 0) {
obj[is.na(obj)] <- val
obj
}
tune_imp(
obj,
data.frame(val = c(0, 1)),
.f = custom_fill,
num_na = 10,
.progress = FALSE
)
# Parallel tuning with mirai
mirai::daemons(2)
parameters_custom <- data.frame(mean = c(0, 1), sd = c(1, 1))
custom_imp <- function(obj, mean, sd) {
na_pos <- is.na(obj)
obj[na_pos] <- stats::rnorm(sum(na_pos), mean = mean, sd = sd)
obj
}
results_p <- tune_imp(
obj,
parameters_custom,
.f = custom_imp,
n_reps = 1,
num_na = 10,
.progress = FALSE
)
mirai::daemons(0)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.