sample_na_loc: Sample Missing Value Locations with Constraints

View source: R/tune_imp.R

sample_na_locR Documentation

Sample Missing Value Locations with Constraints

Description

Samples indices for NA injection into a matrix while maintaining row/column missing value budgets and avoiding zero-variance columns.

Usage

sample_na_loc(
  obj,
  n_cols = NULL,
  n_rows = 2L,
  num_na = NULL,
  n_reps = 1L,
  rowmax = 0.9,
  colmax = 0.9,
  na_col_subset = NULL,
  max_attempts = 100
)

Arguments

obj

A numeric matrix with samples in rows and features in columns.

n_cols

Integer. The number of columns to receive injected NA per repetition. Ignored when num_na is supplied (in which case n_cols is derived as num_na %/% n_rows). Must be provided if num_na is NULL. Ignored in tune_imp() when na_loc is supplied.

n_rows

Integer. The target number of NA values per column (default 2L).

  • When num_na is supplied: used as the base size. Most columns receive exactly n_rows missing values; num_na %% n_rows columns receive one extra. If there's only one column, it receives all the remainder.

  • When num_na is NULL: every selected column receives exactly n_rows NA. Ignored in tune_imp() when na_loc is supplied.

num_na

Integer. Total number of missing values to inject per repetition. If supplied, n_cols is computed automatically and missing values are distributed as evenly as possible, using n_rows as the base size (num_na must be at least n_rows). If omitted but n_cols is supplied, the total injected is n_cols * n_rows. If num_na, n_cols, and na_loc are all NULL, tune_imp() defaults to roughly 5% of total cells, capped at 500. sample_na_loc() has no default. Ignored in tune_imp() when na_loc is supplied.

n_reps

Integer. Number of repetitions for random NA injection (default 1).

rowmax, colmax

Numbers between 0 and 1. NA injection cannot create rows/columns with a higher proportion of missing values than these thresholds.

na_col_subset

Optional integer or character vector restricting which columns of obj are eligible for NA injection.

  • If NULL (default): all columns are eligible.

  • If character: values must exist in colnames(obj).

  • If integer/numeric: values must be valid 1-based column indices. The vector must be unique and must contain at least n_cols columns (or the number derived from num_na). Ignored in tune_imp() when na_loc is supplied.

max_attempts

Integer. Maximum number of resampling attempts per repetition before giving up due to row-budget exhaustion (default 100).

Details

The function uses a greedy stochastic search for valid NA locations. It ensures that:

  • Total missingness per row and column does not exceed rowmax and colmax.

  • At least two distinct observed values are preserved in every column to ensure the column maintains non-zero variance.

Value

A list of length n_reps. Each element is a two-column integer matrix (row, col) representing the coordinates of the sampled NA locations.

Examples

mat <- matrix(runif(100), nrow = 10)

# Sample 5 `NA` across 5 columns (1 per column)
locs <- sample_na_loc(mat, n_cols = 5, n_rows = 1)
locs

# Inject the `NA` from the first repetition
mat[locs[[1]]] <- NA
mat


slideimp documentation built on April 17, 2026, 1:07 a.m.