| correctHeaps | R Documentation |
Age heaping can cause substantial bias in important demographic measures and thus should be corrected. This function corrects heaping at regular intervals (every 5 or 10 years) by replacing a proportion of heaped observations with draws from fitted truncated distributions.
correctHeaps(
x,
heaps = "10year",
method = "lnorm",
start = 0,
fixed = NULL,
model = NULL,
dataModel = NULL,
seed = NULL,
na.action = "omit",
verbose = FALSE,
sd = NULL
)
correctHeaps2(
x,
heaps = "10year",
method = "lnorm",
start = 0,
fixed = NULL,
model = NULL,
dataModel = NULL,
seed = NULL,
na.action = "omit",
verbose = FALSE,
sd = NULL
)
x |
numeric vector of ages (typically integers). |
heaps |
character string specifying the heaping pattern:
Alternatively, a numeric vector specifying custom heap positions. |
method |
character string specifying the distribution used for correction:
|
start |
numeric value for the starting point of the heap sequence
(default 0). Use 5 if heaps occur at 5, 15, 25, ... instead of 0, 10, 20, ...
Ignored if |
fixed |
numeric vector of indices indicating observations that should not be changed. Useful for preserving known accurate values. |
model |
optional formula for model-based correction. When provided, a random forest model is fit to predict age from other variables, and the correction direction is adjusted to be consistent with this prediction. Requires packages ranger and VIM. |
dataModel |
data frame containing variables for the model formula.
Required when |
seed |
optional integer for random seed to ensure reproducibility.
If |
na.action |
character string specifying how to handle
|
verbose |
logical. If |
sd |
optional numeric value for standard deviation when |
Correct for age heaping at regular intervals using truncated distributions.
For method “lnorm”, a truncated log-normal distribution is fit to the whole age distribution. Then for each age heap (at 0, 5, 10, 15, ... or 0, 10, 20, ...) random numbers from a truncated log-normal distribution (with lower and upper bounds) are drawn.
The correction range depends on the heap type:
For 5-year heaps: values are drawn from \pm 2 years around the heap
For 10-year heaps: values are drawn in two groups, \pm 4 and
\pm 5 years around the heap
The ratio of observations to replace is calculated by comparing the count at each heap age to the arithmetic mean of the two neighboring ages. For example, for age heap 5, the ratio is: count(age=5) / mean(count(age=4), count(age=6)).
Method “norm” uses truncated normal distributions instead. The choice between “lnorm” and “norm” depends on whether the age distribution is right-skewed (use “lnorm”) or more symmetric (use “norm”). Many distributions with heaping problems are right-skewed.
Method “unif” draws from truncated uniform distributions around the age heaps, providing a simpler baseline approach.
Method “kernel” uses kernel density estimation to sample replacement values, providing a nonparametric alternative that adapts to the local data distribution.
Repeated calls of this function mimic multiple imputation, i.e., repeating
this procedure m times provides m corrected datasets that properly reflect
the uncertainty from the correction process. Use the seed parameter
to ensure reproducibility.
If verbose = FALSE, a numeric vector of the same length as
x with heaping corrected. If verbose = TRUE, a list with:
the corrected numeric vector
total number of values changed
named vector of changes per heap age
named vector of heaping ratios per heap age
method used
seed used (if any)
Matthias Templ, Bernhard Meindl
Templ, M. (2026). Correction of heaping on individual level. Journal TBD.
Templ, M., Meindl, B., Kowarik, A., Alfons, A., Dupriez, O. (2017). Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Software, 79(10), 1-38. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v079.i10")}
correctSingleHeap for correcting a single specific heap.
Other heaping correction:
correctSingleHeap()
# Create artificial age data with log-normal distribution
set.seed(123)
age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age <- round(age[age < 93])
# Artificially introduce 5-year heaping
year5 <- seq(0, max(age), 5)
age5 <- sample(c(age, age[age %in% year5]))
# Correct with reproducible results
age5_corrected <- correctHeaps(age5, heaps = "5year", method = "lnorm", seed = 42)
# Get diagnostic information
result <- correctHeaps(age5, heaps = "5year", verbose = TRUE, seed = 42)
print(result$n_changed)
print(result$ratios)
# Use kernel method for nonparametric correction
age5_kernel <- correctHeaps(age5, heaps = "5year", method = "kernel", seed = 42)
# Custom heap positions (e.g., heaping at 12, 18, 21)
custom_heaps <- c(12, 18, 21)
age_custom <- correctHeaps(age5, heaps = custom_heaps, method = "lnorm", seed = 42)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.