projection_randomforest: Projection Estimator with Random Forest Algorithm
In sae.projection: Small Area Estimation Using Model-Assisted Projection Method

View source: R/projection_randomforest.R

projection_randomforest

R Documentation

Projection Estimator with Random Forest Algorithm

Description

Kim and Rao (2012), the synthetic data obtained through the model-assisted projection method can provide a useful tool for efficient domain estimation when the size of the sample in survey B is much larger than the size of sample in survey A.

The function projects estimated values from a small survey (survey A) onto an independent large survey (survey B) using the random forest classification algorithm. The two surveys are statistically independent, but the projection relies on shared auxiliary variables. The process includes data preprocessing, feature selection, model training, and domain-specific estimation based on survey design principles "two stages one phase". The function automatically selects standard estimation or bias-corrected estimation based on the parameter bias_correction.

bias_correction = TRUE can only be used if there is psu, ssu, strata on the data_model. If it doesn't, then it will automatically be bias_correction = FALSE

Usage

projection_randomforest(
  data_model,
  target_column,
  predictor_cols,
  data_proj,
  domain1,
  domain2,
  psu,
  ssu = NULL,
  strata = NULL,
  weights,
  split_ratio = 0.8,
  feature_selection = TRUE,
  bias_correction = FALSE
)

Arguments

`data_model`	The training dataset, consisting of auxiliary variables and the target variable.
`target_column`	The name of the target column in the `data_model`.
`predictor_cols`	A vector of predictor column names.
`data_proj`	The data for projection (prediction), which needs to be projected using the trained model. It must contain the same auxiliary variables as the `data_model`
`domain1`	Domain variables for survey estimation (e.g., "province")
`domain2`	Domain variables for survey estimation (e.g., "regency")
`psu`	Primary sampling units, representing the structure of the sampling frame.
`ssu`	Secondary sampling units, representing the structure of the sampling frame (default is NULL).
`strata`	Stratification variable, ensuring that specific subgroups are represented (default is NULL).
`weights`	Weights used for the direct estimation from `data_model` and indirect estimation from `data_proj`.
`split_ratio`	Proportion of data used for training (default is 0.8, meaning 80 percent for training and 20 percent for validation).
`feature_selection`	Selection of predictor variables (default is `TRUE`)
`bias_correction`	Logical; if `TRUE`, then bias correction is applied, if `FALSE`, then bias correction is not applied. Default is `FALSE`.

Value

A list containing the following elements:

model The trained Random Forest model.
importance Feature importance showing which features contributed most to the model's predictions.
train_accuracy Accuracy of the model on the training set.
validation_accuracy Accuracy of the model on the validation set.
validation_performance Confusion matrix for the validation set, showing performance metrics like accuracy, precision, recall, etc.
data_proj The projection data with predicted values.

if bias_correction = FALSE:

Domain1 Estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain2 Estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).

if bias_correction = TRUE:

Direct Direct estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain1_corrected_bias Bias-corrected estimations for Domain 1, including estimated values, variance, and relative standard error (RSE).
Domain2_corrected_bias Bias-corrected estimations for Domain 2, including estimated values, variance, and relative standard error (RSE).

References

Kim, J. K., & Rao, J. N. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika, 99(1), 85-100.

Examples


library(survey)
library(caret)
library(dplyr)

data_A <- df_svy_A
data_B <- df_svy_B

# Get predictor variables from data_model
x_predictors <- data_A %>% select(5:19) %>% names()

# Run projection_randomforest with bias correction
rf_proj_corrected <- projection_randomforest(
                data_model = data_A,
                target_column = "Y",
                predictor_cols = x_predictors,
                data_proj = data_B,
                domain1 = "province",
                domain2 = "regency",
                psu = "num",
                ssu = NULL,
                strata = NULL,
                weights = "weight",
                feature_selection = TRUE,
                bias_correction = TRUE)

rf_proj_corrected$Direct
rf_proj_corrected$Domain1_corrected_bias
rf_proj_corrected$Domain2_corrected_bias

sae.projection documentation built on Aug. 8, 2025, 7:32 p.m.