sae.projection: A Model-Assisted Projection Estimator for Combining Independent Surveys"

Introduction

The ma_projection() function implements a model-assisted projection estimator for combining information from two independent surveys. This method is especially useful in survey sampling scenarios where:

This vignette illustrates how to use ma_projection() for domain-level estimation using various supervised learning models, including machine learning techniques via the parsnip interface.

Method Overview

The approach follows the work of Kim & Rao (2012), where a working model is trained on Survey 2 to predict the outcome variable. Predictions are made for the auxiliary-only Survey 1 data. These predictions are then aggregated by domain to generate small area estimates.

Required Packages

library(sae.projection)
library(dplyr)
library(tidymodels)
library(bonsai)  # for modern tree-based models

Example: Income Estimation Using Linear Regression

# Filter non-missing values for income
svy22_income <- df_svy22 %>% filter(!is.na(income))
svy23_income <- df_svy23 %>% filter(!is.na(income))

# Fit projection model
lm_result <- ma_projection(
  income ~ age + sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = linear_reg(),
  data_model = svy22_income,
  data_proj = svy23_income,
  nest = TRUE
)

# View results
head(lm_result$df_result)

Example: Binary Outcome Using Logistic Regression

# Filter youth population for NEET classification
svy22_neet <- df_svy22 %>% filter(between(age, 15, 24))
svy23_neet <- df_svy23 %>% filter(between(age, 15, 24))

# Fit logistic regression model
lr_result <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = ~ PSU,
  weight = ~ WEIGHT,
  strata = ~ STRATA,
  domain = ~ PROV + REGENCY,
  working_model = logistic_reg(),
  data_model = svy22_neet,
  data_proj = svy23_neet,
  nest = TRUE
)

# View results
head(lr_result$df_result)

Example: LightGBM with Hyperparameter Tuning

# Define LightGBM model with tuning
lgbm_model <- boost_tree(
  mtry = tune(), trees = tune(), min_n = tune(),
  tree_depth = tune(), learn_rate = tune(),
  engine = "lightgbm"
)

# Fit with cross-validation
lgbm_result <- ma_projection(
  formula = neet ~ sex + edu + disability,
  cluster_ids = "PSU",
  weight = "WEIGHT",
  strata = "STRATA",
  domain = c("PROV", "REGENCY"),
  working_model = lgbm_model,
  data_model = svy22_neet,
  data_proj = svy23_neet,
  cv_folds = 3,
  tuning_grid = 5,
  nest = TRUE
)

# View results
head(lgbm_result$df_result)

Supported Models

ma_projection() supports many working models using the parsnip interface, including:

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85–100. doi:10.1093/biomet/asr063

Conclusion

ma_projection() provides a flexible and robust way to combine survey data using modern modeling tools. It supports a wide range of use cases including socioeconomic indicators, health estimates, and more.



Try the sae.projection package in your browser

Any scripts or data that you put into this service are public.

sae.projection documentation built on Aug. 8, 2025, 7:32 p.m.