Introduction to the heaping Package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)

Overview

The heaping package provides tools for detecting and correcting heaping (digit preference) in survey data at the individual record level. Heaping occurs when respondents disproportionately report values at round numbers---for example, ages ending in 0 or 5, incomes rounded to the nearest thousand, or weights rounded to the nearest 5 pounds.

The package provides:

  1. Heaping indices for detecting and measuring heaping severity
  2. Correction methods for removing heaping while preserving individual-level data
  3. Model-based correction that preserves relationships with covariates
library(heaping)

The Problem: What is Heaping?

Heaping is a common data quality issue where certain values appear more frequently than expected due to rounding or digit preference. This is especially common with self-reported data.

Let's create some example data to illustrate, using the same distribution parameters as in Templ (2024):

# Generate realistic age data using a log-normal distribution
# These parameters produce a realistic age distribution
set.seed(123)
age_true <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age_true <- round(age_true[age_true < 93])

# Create heaped version where 27% of respondents round their age to multiples of 5
# This heap_ratio is typical for survey data with moderate heaping
heap_ratio <- 0.27
n <- length(age_true)
indices_to_heap <- sample(n, round(heap_ratio * n))
age_heaped <- age_true
age_heaped[indices_to_heap] <- round(age_heaped[indices_to_heap] / 5) * 5

Let's visualize the difference:

oldpar <- par(mfrow = c(1, 2))

# Define breaks to cover the full range
age_breaks <- seq(0, max(age_true, age_heaped) + 1, by = 1)

# True ages
hist(age_true, breaks = age_breaks, col = "steelblue",
     main = "True Ages (No Heaping)", xlab = "Age",
     border = "white")

# Heaped ages
hist(age_heaped, breaks = age_breaks, col = "coral",
     main = "Heaped Ages", xlab = "Age",
     border = "white")
par(oldpar)

Notice the spikes at ages ending in 0 and 5 in the heaped data.

Detecting Heaping: Heaping Indices

Before correcting heaping, we need to detect and quantify it. The package provides several established indices.

Whipple's Index

The most commonly used index. It compares the frequency of ages ending in 0 or 5 to the expected frequency.

# Standard Whipple index (100 = no heaping, 500 = maximum heaping)
whipple(age_true, method = "standard")
whipple(age_heaped, method = "standard")

# Modified Whipple index (0 = no heaping, 1 = maximum heaping)
whipple(age_true, method = "modified")
whipple(age_heaped, method = "modified")

Myers' Index

Myers' blended index measures preferences for each terminal digit (0-9):

# Myers' index (0 = no heaping, 90 = maximum)
myers(age_true)
myers(age_heaped)

Bachi's Index

Similar to Myers' but uses a different blending technique:

# Bachi's index (0 = no heaping, 90 = maximum)
bachi(age_true)
bachi(age_heaped)

Noumbissi's Index

Evaluates heaping for a specific terminal digit:

# Noumbissi's index for digit 0 (1.0 = no heaping)
noumbissi(age_true, digit = 0)
noumbissi(age_heaped, digit = 0)

# Noumbissi's index for digit 5
noumbissi(age_true, digit = 5)
noumbissi(age_heaped, digit = 5)

Convenience Function: All Indices at Once

# Calculate all indices
indices_true <- heaping_indices(age_true)
indices_heaped <- heaping_indices(age_heaped)

# Compare
data.frame(
  Index = names(indices_true),
  True = round(unlist(indices_true), 3),
  Heaped = round(unlist(indices_heaped), 3)
)

Using Sampling Weights

All indices support sampling weights for survey data:

# Create example weights
weights <- runif(length(age_heaped), 0.5, 2)

# Weighted Whipple index
whipple(age_heaped, weight = weights)

Correcting Heaping

The key innovation of this package is correcting heaping at the individual level. Traditional methods only smooth aggregated frequency distributions, which destroys individual records needed for regression analysis.

Basic Correction

The correctHeaps() function corrects heaping by replacing a proportion of heaped values with draws from truncated distributions:

# Correct heaping using log-normal distribution
age_corrected <- correctHeaps(age_heaped,
                              heaps = "5year",  # heaps at 0, 5, 10, 15, ...
                              method = "lnorm",
                              seed = 42)        # for reproducibility

# Compare Whipple indices
c(Original = whipple(age_true),
  Heaped = whipple(age_heaped),
  Corrected = whipple(age_corrected))

Let's visualize the correction:

oldpar <- par(mfrow = c(1, 2))

hist(age_heaped, breaks = age_breaks, col = "coral",
     main = "Before Correction", xlab = "Age", border = "white")

hist(age_corrected, breaks = age_breaks, col = "forestgreen",
     main = "After Correction", xlab = "Age", border = "white")

par(oldpar)

Correction Methods

Four distribution methods are available:

# Log-normal (default) - good for right-skewed data like age
age_lnorm <- correctHeaps(age_heaped, method = "lnorm", seed = 42)

# Normal - good for symmetric data
age_norm <- correctHeaps(age_heaped, method = "norm", seed = 42)

# Uniform - simplest approach, baseline comparison
age_unif <- correctHeaps(age_heaped, method = "unif", seed = 42)

# Kernel - nonparametric, adapts to local data shape
age_kernel <- correctHeaps(age_heaped, method = "kernel", seed = 42)

# Compare results
data.frame(
  Method = c("Original", "Heaped", "Log-normal", "Normal", "Uniform", "Kernel"),
  Whipple = c(whipple(age_true), whipple(age_heaped),
              whipple(age_lnorm), whipple(age_norm),
              whipple(age_unif), whipple(age_kernel))
)

Verbose Output and Diagnostics

Use verbose = TRUE to get detailed information about the correction:

result <- correctHeaps(age_heaped, heaps = "5year",
                       method = "lnorm", seed = 42, verbose = TRUE)

# Number of values changed
result$n_changed

# Changes by heap age
head(result$changes_by_heap, 10)

# Heaping ratios (ratio > 1 indicates heaping)
head(result$ratios, 10)

10-Year Heaping

For data with heaping only at ages ending in 0:

# Create data with 10-year heaping
age_heap10 <- age_true
heapers10 <- sample(c(TRUE, FALSE), length(age_true),
                    replace = TRUE, prob = c(0.3, 0.7))
age_heap10[heapers10] <- round(age_heap10[heapers10] / 10) * 10

# Correct 10-year heaping
age_corrected10 <- correctHeaps(age_heap10, heaps = "10year",
                                method = "lnorm", seed = 42)

c(Heaped = whipple(age_heap10),
  Corrected = whipple(age_corrected10))

Custom Heap Positions

For non-standard heaping patterns, specify exact positions:

# Heaping at specific ages (e.g., 18, 21, 30, 40, 50, 65)
custom_positions <- c(18, 21, 30, 40, 50, 65)
age_custom <- correctHeaps(age_heaped,
                           heaps = custom_positions,
                           method = "lnorm", seed = 42)

Single Heap Correction

To correct heaping at just one specific age:

# Add artificial heap at age 40
age_with_40heap <- c(age_true, rep(40, 500))

# Correct only the heap at 40
age_fixed_40 <- correctSingleHeap(age_with_40heap,
                                   heap = 40,
                                   before = 3,  # range: 37-43
                                   after = 3,
                                   method = "lnorm",
                                   seed = 42)

# Check the counts at age 40
c(Before = sum(age_with_40heap == 40),
  After = sum(age_fixed_40 == 40))

Model-Based Correction

When age is correlated with other variables (income, education, etc.), simple correction can distort these relationships. Model-based correction uses covariates to make smarter corrections.

# Create a dataset with correlated variables following the paper's approach
set.seed(123)

# Generate age from log-normal distribution
age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age <- round(age[age < 93 & age >= 18])
n <- length(age)

# Simulate covariates correlated with age
age_scaled <- scale(age)

# Income as a function of age with noise
income <- exp(3 + 0.5 * age_scaled + rnorm(n, 0, 0.5))

# Marital status influenced by age
marital_status <- ifelse(plogis(-3 + 0.05 * age_scaled + rnorm(n, 0, 0.5)) > 0.5,
                         "married", "single")

# Education level with age-dependent probabilities
get_education_prob <- function(a) {
  if (a < 25) return(c(0.5, 0.35, 0.15))
  else if (a < 40) return(c(0.3, 0.45, 0.25))
  else return(c(0.2, 0.4, 0.4))
}
education <- sapply(age, function(a) {
  sample(c("High School", "Bachelor", "Master"), 1, prob = get_education_prob(a))
})

data_example <- data.frame(
  age = age,
  income = income,
  marital_status = marital_status,
  education = education
)

# Introduce heaping with 27% heap ratio (as in the paper)
heap_ratio <- 0.27
indices_to_heap <- sample(n, round(heap_ratio * n))
data_example$age_heaped <- data_example$age
data_example$age_heaped[indices_to_heap] <- round(data_example$age_heaped[indices_to_heap] / 5) * 5

Apply model-based correction:

# Model-based correction using income, education and marital status
if (requireNamespace("ranger", quietly = TRUE) &&
    requireNamespace("VIM", quietly = TRUE)) {

  # Also apply simple correction for comparison
  data_example$age_simple <- correctHeaps(
    data_example$age_heaped,
    heaps = "5year",
    method = "lnorm",
    seed = 42
  )

  # Model-based correction using covariates
  data_example$age_corrected <- correctHeaps(
    data_example$age_heaped,
    heaps = "5year",
    method = "lnorm",
    model = age_heaped ~ income + marital_status + education,
    dataModel = data_example,
    seed = 42
  )

  # Compare correlations with log(income)
  log_income <- log(data_example$income)
  cor_original <- cor(data_example$age, log_income)
  cor_heaped <- cor(data_example$age_heaped, log_income)
  cor_simple <- cor(data_example$age_simple, log_income)
  cor_corrected <- cor(data_example$age_corrected, log_income)

  cat("Correlation of age with log(income):\n")
  cat("  Original (true):", round(cor_original, 4), "\n")
  cat("  Heaped:", round(cor_heaped, 4), "\n")
  cat("  Simple correction:", round(cor_simple, 4), "\n")
  cat("  Model-based correction:", round(cor_corrected, 4), "\n")
}

Let's visualize the model-based correction:

if (requireNamespace("ranger", quietly = TRUE) &&
    requireNamespace("VIM", quietly = TRUE)) {

  oldpar <- par(mfrow = c(2, 2))

  # Age distributions
  age_breaks <- seq(min(data_example$age) - 1, max(data_example$age) + 1, by = 1)
  hist(data_example$age_heaped, breaks = age_breaks, col = "coral",
       main = "Heaped Ages", xlab = "Age", border = "white")

  hist(data_example$age_corrected, breaks = age_breaks, col = "forestgreen",
       main = "Model-Based Corrected Ages", xlab = "Age", border = "white")

  # Age vs log(Income) relationships
  log_income <- log(data_example$income)
  plot(data_example$age_heaped, log_income,
       pch = 16, col = adjustcolor("coral", 0.3), cex = 0.5,
       main = "Heaped: Age vs log(Income)",
       xlab = "Age", ylab = "log(Income)")
  abline(lm(log_income ~ data_example$age_heaped), col = "darkred", lwd = 2)

  plot(data_example$age_corrected, log_income,
       pch = 16, col = adjustcolor("forestgreen", 0.3), cex = 0.5,
       main = "Corrected: Age vs log(Income)",
       xlab = "Age", ylab = "log(Income)")
  abline(lm(log_income ~ data_example$age_corrected), col = "darkgreen", lwd = 2)

  par(oldpar)
}

The model-based correction preserves the relationship between age and income better than simple correction because it uses the covariate information to determine the direction of adjustments.

Multiple Imputation Framework

Because correction involves random sampling, you can create multiple corrected datasets for proper uncertainty quantification:

# Create 5 corrected datasets with different seeds
m <- 5
corrected_datasets <- lapply(1:m, function(i) {
  correctHeaps(age_heaped, heaps = "5year",
               method = "lnorm", seed = i * 100)
})

# Calculate Whipple index for each
whipple_values <- sapply(corrected_datasets, whipple)

cat("Whipple indices across imputations:\n")
cat("  Mean:", round(mean(whipple_values), 2), "\n")
cat("  SD:", round(sd(whipple_values), 2), "\n")
cat("  Range:", round(min(whipple_values), 2), "-",
    round(max(whipple_values), 2), "\n")

Protecting Specific Observations

If some ages are known to be accurate (e.g., verified from administrative records), exclude them from correction:

# Assume first 100 observations are verified
verified_indices <- 1:100

age_protected <- correctHeaps(age_heaped,
                              heaps = "5year",
                              method = "lnorm",
                              fixed = verified_indices,  # don't change these
                              seed = 42)

# Verify protected observations unchanged
all(age_heaped[verified_indices] == age_protected[verified_indices])

Working with Aggregated Data: Sprague Multipliers

If you have 5-year age group data and need single-year ages, use the Sprague multipliers:

# Example: population counts in 5-year groups
pop_5year <- c(
  1971990, 2095820, 2157190, 2094110, 2116580,  # 0-4, 5-9, ..., 20-24
  2003840, 1785690, 1502990, 1214170, 796934,   # 25-29, ..., 45-49
  627551, 530305, 488014, 364498, 259029,       # 50-54, ..., 70-74
  158047, 125941                                 # 75-79, 80+
)

# Disaggregate to single years
pop_single <- sprague(pop_5year)

# First 15 ages
head(pop_single, 15)

# Total is preserved
c(Sum_5year = sum(pop_5year), Sum_single = sum(pop_single))

Indices for Old Ages

For data quality assessment at older ages, specialized indices are available:

# Create old-age data with heaping
set.seed(42)
old_ages <- c(sample(85:105, 3000, replace = TRUE),
              rep(c(90, 95, 100), each = 200))  # heaping at round ages

# Coale-Li index (designed for ages 60+)
coale_li(old_ages, digit = 0, ageMin = 85)

# Jdanov index (for very old ages like 95, 100, 105)
jdanov(old_ages, Agei = c(95, 100, 105))

# Kannisto index (for a single old age)
kannisto(old_ages, Agei = 90)
kannisto(old_ages, Agei = 95)

Summary

The heaping package provides a complete toolkit for handling heaping in survey data:

| Function | Purpose | |----------|---------| | whipple() | Standard and modified Whipple index | | myers() | Myers' blended index | | bachi() | Bachi's index | | noumbissi() | Single-digit Noumbissi index | | spoorenberg() | Total Modified Whipple index | | coale_li() | Coale-Li index for old ages | | jdanov() | Jdanov index for very old ages | | kannisto() | Kannisto index for single old age | | heaping_indices() | All common indices at once | | correctHeaps() | Correct regular heaping patterns | | correctSingleHeap() | Correct a single heap | | sprague() | Disaggregate 5-year to single-year ages |

Key advantages over traditional methods:

  1. Preserves individual records for regression and machine learning
  2. Multiple correction methods (log-normal, normal, uniform, kernel)
  3. Model-based correction preserves covariate relationships
  4. Supports sampling weights for survey data
  5. Multiple imputation compatible for uncertainty quantification

References



Try the heaping package in your browser

Any scripts or data that you put into this service are public.

heaping documentation built on Feb. 10, 2026, 1:08 a.m.