correctHeaps: Correct Age Heaping

View source: R/correctHeap.R

correctHeapsR Documentation

Correct Age Heaping

Description

Age heaping can cause substantial bias in important demographic measures and thus should be corrected. This function corrects heaping at regular intervals (every 5 or 10 years) by replacing a proportion of heaped observations with draws from fitted truncated distributions.

Usage

correctHeaps(
  x,
  heaps = "10year",
  method = "lnorm",
  start = 0,
  fixed = NULL,
  model = NULL,
  dataModel = NULL,
  seed = NULL,
  na.action = "omit",
  verbose = FALSE,
  sd = NULL
)

correctHeaps2(
  x,
  heaps = "10year",
  method = "lnorm",
  start = 0,
  fixed = NULL,
  model = NULL,
  dataModel = NULL,
  seed = NULL,
  na.action = "omit",
  verbose = FALSE,
  sd = NULL
)

Arguments

x

numeric vector of ages (typically integers).

heaps

character string specifying the heaping pattern:

"5year"

heaps are assumed every 5 years (0, 5, 10, 15, ...)

"10year"

heaps are assumed every 10 years (0, 10, 20, ...)

Alternatively, a numeric vector specifying custom heap positions.

method

character string specifying the distribution used for correction:

"lnorm"

truncated log-normal distribution (default). Parameters are estimated from the input data.

"norm"

truncated normal distribution. Parameters are estimated from the input data.

"unif"

uniform distribution within the truncation bounds.

"kernel"

kernel density estimation for nonparametric sampling.

start

numeric value for the starting point of the heap sequence (default 0). Use 5 if heaps occur at 5, 15, 25, ... instead of 0, 10, 20, ... Ignored if heaps is a numeric vector.

fixed

numeric vector of indices indicating observations that should not be changed. Useful for preserving known accurate values.

model

optional formula for model-based correction. When provided, a random forest model is fit to predict age from other variables, and the correction direction is adjusted to be consistent with this prediction. Requires packages ranger and VIM.

dataModel

data frame containing variables for the model formula. Required when model is specified. Missing values are imputed using k-nearest neighbors via kNN.

seed

optional integer for random seed to ensure reproducibility. If NULL (default ), no seed is set.

na.action

character string specifying how to handle NA values:

"omit"

remove NA values before processing, then restore positions (default)

"fail"

stop with an error if NA values are present

verbose

logical. If TRUE, return a list with corrected values and diagnostic information. If FALSE (default), return only the corrected vector.

sd

optional numeric value for standard deviation when method = "norm". If NULL (default), estimated from the data using MAD (median absolute deviation) of non-heap ages, which is robust to the heaping.

Details

Correct for age heaping at regular intervals using truncated distributions.

For method “lnorm”, a truncated log-normal distribution is fit to the whole age distribution. Then for each age heap (at 0, 5, 10, 15, ... or 0, 10, 20, ...) random numbers from a truncated log-normal distribution (with lower and upper bounds) are drawn.

The correction range depends on the heap type:

  • For 5-year heaps: values are drawn from \pm 2 years around the heap

  • For 10-year heaps: values are drawn in two groups, \pm 4 and \pm 5 years around the heap

The ratio of observations to replace is calculated by comparing the count at each heap age to the arithmetic mean of the two neighboring ages. For example, for age heap 5, the ratio is: count(age=5) / mean(count(age=4), count(age=6)).

Method “norm” uses truncated normal distributions instead. The choice between “lnorm” and “norm” depends on whether the age distribution is right-skewed (use “lnorm”) or more symmetric (use “norm”). Many distributions with heaping problems are right-skewed.

Method “unif” draws from truncated uniform distributions around the age heaps, providing a simpler baseline approach.

Method “kernel” uses kernel density estimation to sample replacement values, providing a nonparametric alternative that adapts to the local data distribution.

Repeated calls of this function mimic multiple imputation, i.e., repeating this procedure m times provides m corrected datasets that properly reflect the uncertainty from the correction process. Use the seed parameter to ensure reproducibility.

Value

If verbose = FALSE, a numeric vector of the same length as x with heaping corrected. If verbose = TRUE, a list with:

corrected

the corrected numeric vector

n_changed

total number of values changed

changes_by_heap

named vector of changes per heap age

ratios

named vector of heaping ratios per heap age

method

method used

seed

seed used (if any)

Author(s)

Matthias Templ, Bernhard Meindl

References

Templ, M. (2026). Correction of heaping on individual level. Journal TBD.

Templ, M., Meindl, B., Kowarik, A., Alfons, A., Dupriez, O. (2017). Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Software, 79(10), 1-38. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v079.i10")}

See Also

correctSingleHeap for correcting a single specific heap.

Other heaping correction: correctSingleHeap()

Examples

# Create artificial age data with log-normal distribution
set.seed(123)
age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age <- round(age[age < 93])

# Artificially introduce 5-year heaping
year5 <- seq(0, max(age), 5)
age5 <- sample(c(age, age[age %in% year5]))

# Correct with reproducible results
age5_corrected <- correctHeaps(age5, heaps = "5year", method = "lnorm", seed = 42)

# Get diagnostic information
result <- correctHeaps(age5, heaps = "5year", verbose = TRUE, seed = 42)
print(result$n_changed)
print(result$ratios)

# Use kernel method for nonparametric correction
age5_kernel <- correctHeaps(age5, heaps = "5year", method = "kernel", seed = 42)

# Custom heap positions (e.g., heaping at 12, 18, 21)
custom_heaps <- c(12, 18, 21)
age_custom <- correctHeaps(age5, heaps = custom_heaps, method = "lnorm", seed = 42)

heaping documentation built on Feb. 10, 2026, 1:08 a.m.