slideimp"
In slideimp: Numeric Matrices K-NN and PCA Imputation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(slideimp)
set.seed(1234)

Comparing Methods Using a Shared Cross-Validation Split with `tune_imp()`

Simulate some data

# 20 rows, 1000 columns, all columns have at least some NA
sim_obj <- sim_mat(n = 20, p = 1000, perc_col_na = 1)
obj <- sim_obj$input
obj[1:4, 1:4]

Instead of random NA sampling inside tune_imp(), which internally calls sample_na_loc(), we can generate the NA locations up front and pass them to tune_imp().
Here, we randomly select 5 rows from each of 200 random columns (features) for 5 repetitions using sample_na_loc(). To specify just certain columns (i.e., clock CpGs), provide the na_col_subset argument.
na_loc has 5 elements (5 repeats) where each row stores the row and column index of a missing value.

na_loc <- sample_na_loc(obj, n_cols = 200, n_rows = 5, n_reps = 5)
length(na_loc)
na_loc[[1]][1:6, ]

Then we can compare 1) PCA, 2) K-NN imputation, and 3) a custom method, since the cross-validation missing values are the same for all methods.
Note: The custom function requires the first argument to be obj, must return an object of the same dimensions, and all subsequent arguments must match the column names of the parameters data.frame.

# This custom function imputes missing values with random normal values and takes
# `mean` and `sd` as params
rnorm_imp <- function(obj, mean, sd) {
  na <- is.na(obj)
  obj[na] <- rnorm(sum(na), mean = mean, sd = sd) # <- impute values with rnorm
  return(obj) # <- return an imputed object with the same dim as obj
}

pca_tune <- tune_imp(
  obj,
  .f = "pca_imp",
  na_loc = na_loc,
  parameters = data.frame(ncp = 10)
)

knn_tune <- tune_imp(
  obj,
  .f = "knn_imp",
  na_loc = na_loc,
  parameters = data.frame(k = 10)
)

rnorm_tune <- tune_imp(
  obj,
  .f = rnorm_imp,
  na_loc = na_loc,
  parameters = data.frame(mean = 0, sd = 1) # must match with arguments of `rnorm_imp`
)

Pick the method with the lowest cross-validation error:

mean(compute_metrics(pca_tune, metrics = "rmse")$.estimate)
mean(compute_metrics(knn_tune, metrics = "rmse")$.estimate)
mean(compute_metrics(rnorm_tune, metrics = "rmse")$.estimate)

Group-Wise Imputation with Small-Group Padding and Group-Wise Parameters with `group_imp()`

group_imp() allows imputation to be performed separately within defined groups (e.g., by chromosome), which significantly reduces run time and can increase accuracy for both K-NN and PCA imputation.
group_imp() requires the group argument, which maps colnames(obj) to groups. This can be created up front with prep_groups() for advanced features such as group-wise parameters and padding of small groups with random features from other groups. prep_groups() returns a list-column data.frame with:
features: required - a list-column where each element is a character vector of variable names to be imputed together.
aux: optional - auxiliary variables to include in each group. These are only used to augment the imputation quality of features and are not imputed themselves. If one group is too small (e.g., chrM), aux is used to pad the group by randomly drawing samples from other groups to meet min_group_size.
parameters: optional - group-specific imputation parameters.
First we simulate data from 2 groups. We then create group3 with only 1 feature to show how min_group_size pads it using the aux list column.

sim_obj <- sim_mat(n = 20, p = 50, n_col_groups = 2)

# Matrix to be imputed
obj <- sim_obj$input
obj[1:5, 1:4]

# Metadata, i.e., which features belong to which group
meta <- sim_obj$col_group
meta[1:5, ]

# We put feature 1 in `group3`
meta[1, 2] <- "group3"
meta[1:5, ]

Then we generate the group parameter for group_imp() up front. We can see that group3 has been padded to have 10 columns.
For each group, we also assign a different number of nearest neighbors as a demonstration.

set.seed(1234)
group_imp_df <- prep_groups(colnames(obj), group = meta, min_group_size = 10)
group_imp_df$parameters <- list(list(k = 3), list(k = 4), list(k = 5))
group_imp_df

Finally, we impute obj using the modified group_imp_df. The k = 10 passed to group_imp() is ignored since all groups have group-wise k specified.

knn_results <- group_imp(obj, group = group_imp_df, cores = 4, k = 10)
print(knn_results, p = 4)

Sliding Window Imputation for WGBS/EM-seq Data with `slide_imp()`

Select `window_size`, `overlap_size`, and PCA/K-NN Parameters

We simulate the output of the {methylKit} package.
Note: WGBS/EM-seq data should be grouped by chromosome before performing sliding window imputation.
Here, we simulate 1000 sites.
Importantly, we simulate the location vector of each feature (genomic position) with varying spacing of 50 to 500 bp apart from each other.
The locations vector contains the genomic position of each feature (column). It is used to determine which columns are grouped together given a window size.

set.seed(1234)
sample_names <- paste0("S", 1:10)
n_sites <- 1000

# Simulate positions with 50–500 bp between each site
distances_between <- sample(50:500, size = n_sites, replace = TRUE)
locations <- cumsum(distances_between) # <- important, location vector

methyl <- data.frame(
  chr = "chr1",
  start = locations,
  end = locations,
  strand = "+"
)

for (i in seq_along(sample_names)) {
  methyl[[paste0("numCs", i)]] <- sample.int(100, size = n_sites, replace = TRUE)
  methyl[[paste0("numTs", i)]] <- sample.int(100, size = n_sites, replace = TRUE)
  methyl[[paste0("coverage", i)]] <- methyl[[paste0("numCs", i)]] + methyl[[paste0("numTs", i)]]
}

methyl[1:5, 1:10]

First, we convert the data into a beta matrix with features in the columns.

numCs_matrix <- as.matrix(methyl[, paste0("numCs", seq_along(sample_names))])
cov_matrix <- as.matrix(methyl[, paste0("coverage", seq_along(sample_names))])
beta_matrix <- numCs_matrix / cov_matrix

colnames(beta_matrix) <- sample_names
rownames(beta_matrix) <- methyl$start

beta_matrix <- t(beta_matrix)
# Set 10% of the data to missing
set.seed(1234)
beta_matrix[sample.int(length(beta_matrix), floor(length(beta_matrix) * 0.1))] <- NA
beta_matrix[1:4, 1:4]

Then, in a real dataset, we would tune hyperparameters using chr22. Here, as a demonstration, we use the whole data since the size is small.
We use 2 repetitions of cross-validation (increase to 10–30 in real analyses). We are selecting between:
ncp (number of principal components) of 2 or 4, indicating that we are performing sliding PCA imputation. Pass k for sliding K-NN imputation.
window_size of 5,000 or 10,000 bp.
overlap_size fixed at 1,000 bp (does not affect results much in real analyses).

params <- expand.grid(ncp = c(2, 4), window_size = c(5000, 10000))
params$overlap_size <- 1000
params$min_window_n <- 20 # windows with less than 20 columns are dropped

# Increase n_reps from 2 in actual analyses and use another chromosome (i.e., chr22)
tune_slide_pca <- tune_imp(
  obj = beta_matrix,
  parameters = params,
  .f = "slide_imp",
  n_reps = 2,
  location = locations
)

metrics <- compute_metrics(tune_slide_pca)

aggregate(.estimate ~ .metric + ncp + window_size, data = metrics, FUN = mean)

Finally, we can use slide_imp() to impute the full beta_matrix. Use the best parameter combination from the cross-validation metrics.
First, we use dry_run = TRUE to examine the columns to be imputed.
start and end are window location vectors.
window_n is the number of features included in the window.

slide_imp(
  obj = beta_matrix,
  location = locations,
  window_size = 5000,
  overlap_size = 1000,
  ncp = 2,
  min_window_n = 20,
  dry_run = TRUE # <- dry_run to inspect the windows
)

Turn off dry_run to impute the data

slide_imp(
  obj = beta_matrix,
  location = locations,
  window_size = 5000,
  overlap_size = 1000,
  ncp = 2,
  min_window_n = 20,
  dry_run = FALSE,
  .progress = FALSE
)

Subset and Flanking Mode

Use this mode when you only need to impute specific target features (e.g., clock CpGs) rather than the entire dataset.
Pass the desired feature names to the subset argument. Only windows containing these features will be imputed.
Set flank = TRUE to build windows centered on each feature in the subset. Each window will extend window_size bp on either side of the target feature (flanking mode).
In this mode, the overlap_size argument is ignored.
In this example, we only want to impute the features "1323" and "33810" by creating 5,000 bp flanking windows around each feature:

slide_imp(
  obj = beta_matrix,
  location = locations,
  window_size = 5000,
  ncp = 2,
  min_window_n = 20,
  subset = c("1323", "33810"),
  flank = TRUE,
  dry_run = TRUE
)