| group_imp | R Documentation |
Perform K-NN or PCA imputation independently within feature groups.
group_imp(
obj,
group,
subset = NULL,
allow_unmapped = FALSE,
k = NULL,
ncp = NULL,
method = NULL,
cores = 1,
.progress = TRUE,
min_group_size = NULL,
colmax = NULL,
post_imp = NULL,
dist_pow = NULL,
scale = NULL,
coeff.ridge = NULL,
threshold = NULL,
row.w = NULL,
seed = NULL,
nb.init = NULL,
maxiter = NULL,
miniter = NULL,
solver = NULL,
lobpcg_control = NULL,
clamp = NULL,
pin_blas = FALSE,
na_check = TRUE,
on_infeasible = c("error", "skip", "mean")
)
obj |
A numeric matrix with samples in rows and features in columns. |
group |
Specification of how features should be grouped for imputation. Accepted formats are:
|
subset |
Optional character vector of feature names to impute. If
|
allow_unmapped |
Logical. If |
k |
Integer or |
ncp |
Integer or |
method |
Character or |
cores |
Integer. Number of cores for K-NN imputation only. For PCA
imputation, use |
.progress |
Logical. If |
min_group_size |
Integer or |
colmax |
Numeric scalar between |
post_imp |
Logical. If |
dist_pow |
Numeric. Power used to penalize more distant neighbors in
the weighted average. |
scale |
Logical. If |
coeff.ridge |
Numeric. Ridge regularization, used only when
|
threshold |
Numeric. Convergence threshold. |
row.w |
Row weights, normalized to sum to |
seed |
Integer, numeric, or |
nb.init |
Integer. Number of random initializations. The first initialization is always mean imputation. |
maxiter |
Integer. Maximum number of iterations. |
miniter |
Integer. Minimum number of iterations. |
solver |
Character. Eigensolver: |
lobpcg_control |
A list of LOBPCG eigensolver control options, usually
created by |
clamp |
Optional numeric vector |
pin_blas |
Logical. If |
na_check |
Logical. If |
on_infeasible |
Character. One of |
group_imp() performs K-NN or PCA imputation on feature groups
independently, which can substantially reduce runtime for large matrices.
Specify k and related arguments to use K-NN imputation, or ncp and
related arguments to use PCA imputation. If both k and ncp are NULL,
group$parameters must supply either k or ncp for every group.
Group-specific parameters in group$parameters take priority over global
arguments. Global arguments fill in any missing values. All groups must use
the same imputation method.
For method-specific arguments inherited from knn_imp() or pca_imp(),
NULL means the backend default is used unless overridden by
group$parameters.
Per-group k is capped at the number of usable columns in the group minus
one. Per-group ncp is capped at the maximum feasible number of PCA
components for that group's submatrix. A warning is issued when capping
occurs.
A numeric matrix of the same dimensions as obj, with missing
values imputed. The returned object has class slideimp_results.
K-NN: use the cores argument. If mirai daemons are active, cores
is automatically set to 1 to avoid nested parallelism.
PCA: use mirai::daemons() instead of cores.
When running PCA imputation in parallel with mirai, set pin_blas = TRUE
in tune_imp() or group_imp() to prevent BLAS threads from
oversubscribing CPU cores. This relies on RhpcBLASctl and works with
OpenBLAS and MKL (typical on Linux, and on Windows after an OpenBLAS swap).
pin_blas = TRUE may have no effect on macOS.
Speed comes from three levers: solver (through LOBPCG with warm-start),
threshold, and scale. Tune these first, then accuracy parameters
(ncp, coeff.ridge) on a representative subset.
Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact"
depends on size and low-rankness: "lobpcg" is preferred for large, approximately
low-rank matrices with small ncp, and "exact" for small matrices
(including slide_imp() windows), where it is faster and more robust.
Separately, the warm-start makes each successive solve cheap: pca_imp()
warm-starts LOBPCG with the previous eigenblock and search direction, so once
imputed values stabilize, later solves converge in a few iterations. The
payoff therefore grows with the number of EM iterations, independent of
low-rankness. solver = "auto" (default) probes both and is a safe start.
Threshold. The default 1e-6 is conservative; 1e-5 is often faster
with very similar values.
Scale. For columns on a common scale (e.g., DNAm beta values in
[0, 1]), scale = FALSE can be faster and more accurate.
Parallel and BLAS. In parallel via tune_imp() or group_imp() with a
multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription.
On Windows, the stock BLAS can be slow. Advanced users can swap in
OpenBLAS.
See Speeding up PCA imputation for the full workflow.
A character scalar can be passed to group to name a supported Illumina
platform, such as "EPICv2" or "EPICv2_deduped". This requires the
optional slideimp.extra package to be installed. Supported platforms are
listed in the slideimp_arrays object in the slideimp.extra package.
prep_groups(), knn_imp(), pca_imp()
set.seed(1234)
to_test <- sim_mat(10, 20, perc_total_na = 0.05, perc_col_na = 1)
obj <- to_test$input
group <- to_test$col_group
head(group)
# Simple grouped K-NN imputation
results <- group_imp(obj, group = group, k = 2, .progress = FALSE)
results
# Impute only a subset of features
subset_features <- sample(to_test$col_group$feature, size = 10)
knn_subset <- group_imp(
obj,
group = group,
subset = subset_features,
k = 2,
.progress = FALSE
)
# Use prep_groups() to inspect and edit per-group parameters
prepped <- prep_groups(colnames(obj), group)
prepped$parameters <- lapply(seq_len(nrow(prepped)), function(i) list(k = 2))
prepped$parameters[[2]]$k <- 4
knn_grouped <- group_imp(obj, group = prepped, .progress = FALSE)
# PCA imputation with mirai parallelism
mirai::daemons(2)
pca_grouped <- group_imp(obj, group = group, ncp = 2)
mirai::daemons(0)
pca_grouped
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.