impute_df: Impute missing values in a data frame by columns

View source: R/imputation.R

impute_dfR Documentation

Impute missing values in a data frame by columns

Description

Impute missing values in a data frame by columns

Usage

impute_df(
  x,
  imputation_type = c("none", "mean", "locf", "interp"),
  imputation_span = 5L,
  cyclic = FALSE,
  nmax_run = Inf
)

Arguments

x

A data.frame or matrix with numerical columns. Imputation works on each column separately.

imputation_type

A character string describing the imputation method; currently, one of three values:

  • "none": no imputation is carried out

  • "mean": missing values will be replaced by the average of imputation_span non-missing values before and imputation_span non-missing values after note: this may fail if there are less than 2 * imputation_span non-missing values

  • "locf": missing values will be replaced with the "last-observation-carried-forward"' approach

  • "interp": missing values will be replaced by linear interpolation (or extrapolation if at the start or end of a sequence) using the two closest neighbors assuming that rows represent equidistant steps (for each run of missing values separately)

imputation_span

An integer value. The number of non-missing values considered if imputation_type = "mean".

cyclic

A logical value. If TRUE, then the last row of x is considered to be a direct neighbor of the first row, e.g., rows of x represent day of year for an average year.

nmax_run

An integer value. Runs (sets of consecutive missing values) that are equal or shorter to nmax_run are imputed; longer runs remain unchanged. Any non-finite value is treated as infinity.

Value

An updated version of x where missing values have been imputed for each column separately.

Examples

n <- 30
ids_missing <- c(1:2, 10:13, 20:22, (n-1):n)
x0 <- x <- data.frame(
  linear = seq_len(n),
  all_missing = NA,
  all_same = 1,
  cyclic = cos(2 * pi * seq_len(n) / n)
)
x[ids_missing, ] <- NA

res <- list()
for (it in c("mean", "locf", "interp")) {
  res[[it]] <- impute_df(x, imputation_type = it, nmax_run = 3L)
  print(cbind(orig = x0[ids_missing, ], res[[it]][ids_missing, ]))
}

if (requireNamespace("graphics")) {
  par_prev <- graphics::par(mfrow = c(ncol(x) - 1L, 1L))
  for (k in seq_len(ncol(x))[-2L]) {
    graphics::plot(
      x[[k]],
      ylim = range(x0[[k]]),
      ylab = colnames(x)[[k]],
      type = "l"
    )
    graphics::points(
      ids_missing,
      x0[ids_missing, k],
      pch = 1L,
      col = 1L
    )
    for (it in seq_along(res)) {
      graphics::points(
        ids_missing,
        res[[it]][ids_missing, k],
        pch = 1L + it,
        col = 1L + it
      )
    }
  }
  graphics::par(par_prev)
}


DrylandEcology/rSW2utils documentation built on Dec. 9, 2023, 10:44 p.m.