| slide_imp | R Documentation |
Perform sliding-window K-NN or PCA imputation on a numeric matrix whose
columns are meaningfully ordered. Not intended for Illumina DNA
methylation microarrays, use group_imp() instead.
slide_imp(
obj,
location,
window_size,
overlap_size = 0,
flank = FALSE,
min_window_n,
subset = NULL,
dry_run = FALSE,
k = NULL,
cores = 1,
dist_pow = 0,
ncp = NULL,
scale = TRUE,
coeff.ridge = 1,
threshold = 1e-06,
seed = NULL,
row.w = NULL,
nb.init = 1,
maxiter = 1000,
miniter = 5,
solver = c("auto", "exact", "lobpcg"),
lobpcg_control = NULL,
clamp = NULL,
method = NULL,
.progress = TRUE,
colmax = 0.9,
post_imp = TRUE,
na_check = TRUE,
on_infeasible = c("skip", "error", "mean")
)
obj |
A numeric matrix with samples in rows and features in columns. |
location |
A sorted numeric vector of length |
window_size |
Numeric. Window width in the same units as |
overlap_size |
Numeric. Overlap between consecutive windows in the
same units as |
flank |
Logical. If |
min_window_n |
Integer. Minimum number of columns a window must contain
to be considered for imputation. For non-dry runs, the selected |
subset |
Optional character or integer vector specifying columns to
impute. If |
dry_run |
Logical. If |
k |
Integer. Number of nearest neighbors to use for K-NN imputation. |
cores |
Integer. Number of cores to use for K-NN imputation. |
dist_pow |
Numeric. Power used to penalize more distant neighbors in
the weighted average. |
ncp |
Integer. Number of principal components used to predict missing entries. |
scale |
Logical. If |
coeff.ridge |
Numeric. Ridge regularization, used only when
|
threshold |
Numeric. Convergence threshold. |
seed |
Integer, numeric, or |
row.w |
Row weights, normalized to sum to |
nb.init |
Integer. Number of random initializations. The first initialization is always mean imputation. |
maxiter |
Integer. Maximum number of iterations. |
miniter |
Integer. Minimum number of iterations. |
solver |
Character. Eigensolver: |
lobpcg_control |
A list of LOBPCG eigensolver control options, usually
created by |
clamp |
Optional numeric vector |
method |
Character or |
.progress |
Logical. If |
colmax |
Numeric scalar between |
post_imp |
Logical. If |
na_check |
Logical. If |
on_infeasible |
Character. One of |
The sliding-window approach divides the input matrix into smaller segments
based on location values and applies imputation to each window
independently. Values in overlapping regions are averaged across windows to
produce the final imputed result.
Two windowing modes are supported:
flank = FALSE: greedily partition location into windows of width
window_size with the requested overlap_size between consecutive
windows.
flank = TRUE: create one window per feature in subset, centered on
that feature using the supplied window_size.
Specify k and related arguments to use knn_imp(), or ncp and related
arguments to use pca_imp().
If dry_run = FALSE, a numeric matrix of the same dimensions as
obj, with missing values imputed. The returned object has class
slideimp_results.
If dry_run = TRUE, a data frame of class slideimp_tbl with columns
start, end, and window_n, plus subset_local and, when
flank = TRUE, target.
Speed comes from three levers: solver (through LOBPCG with warm-start),
threshold, and scale. Tune these first, then accuracy parameters
(ncp, coeff.ridge) on a representative subset.
Exact vs. LOBPCG with warm-start. Whether "lobpcg" beats "exact"
depends on size and low-rankness: "lobpcg" is preferred for large, approximately
low-rank matrices with small ncp, and "exact" for small matrices
(including slide_imp() windows), where it is faster and more robust.
Separately, the warm-start makes each successive solve cheap: pca_imp()
warm-starts LOBPCG with the previous eigenblock and search direction, so once
imputed values stabilize, later solves converge in a few iterations. The
payoff therefore grows with the number of EM iterations, independent of
low-rankness. solver = "auto" (default) probes both and is a safe start.
Threshold. The default 1e-6 is conservative; 1e-5 is often faster
with very similar values.
Scale. For columns on a common scale (e.g., DNAm beta values in
[0, 1]), scale = FALSE can be faster and more accurate.
Parallel and BLAS. In parallel via tune_imp() or group_imp() with a
multithreaded BLAS, set pin_blas = TRUE to avoid thread oversubscription.
On Windows, the stock BLAS can be slow. Advanced users can swap in
OpenBLAS.
See Speeding up PCA imputation for the full workflow.
set.seed(1234)
# Example data with 20 samples and 100 ordered columns
beta_matrix <- sim_mat(20, 100)$input
location <- 1:100
# First perform a dry run to inspect the calculated windows
window_statistics <- slide_imp(
beta_matrix,
location = location,
window_size = 50,
overlap_size = 10,
min_window_n = 10,
dry_run = TRUE,
.progress = FALSE
)
window_statistics
# Sliding-window K-NN imputation
imputed_knn <- slide_imp(
beta_matrix,
location = location,
k = 5,
window_size = 50,
overlap_size = 10,
min_window_n = 10,
.progress = FALSE
)
imputed_knn
# Sliding-window PCA imputation
imputed_pca <- slide_imp(
beta_matrix,
location = location,
ncp = 2,
window_size = 50,
overlap_size = 10,
min_window_n = 10,
.progress = FALSE
)
imputed_pca
# K-NN imputation with flanking windows
imputed_flank <- slide_imp(
beta_matrix,
location = location,
k = 2,
window_size = 30,
flank = TRUE,
subset = c(10, 30, 70),
min_window_n = 5,
.progress = FALSE
)
imputed_flank
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.