method_nn: Mass imputation using nearest neighbours matching method

View source: R/method_nn.R

method_nnR Documentation

Mass imputation using nearest neighbours matching method

Description

Mass imputation using nearest neighbours approach as described in Yang et al. (2021). The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from S_A (non-probability) to S_B (probability). Estimation of the mean is done using S_B sample.

Usage

method_nn(
  y_nons,
  X_nons,
  X_rand,
  svydesign,
  weights = NULL,
  family_outcome = NULL,
  start_outcome = NULL,
  vars_selection = FALSE,
  pop_totals = NULL,
  pop_size = NULL,
  control_outcome = control_out(),
  control_inference = control_inf(),
  verbose = FALSE,
  se = TRUE
)

Arguments

y_nons

target variable from non-probability sample

X_nons

a model.matrix with auxiliary variables from non-probability sample

X_rand

a model.matrix with auxiliary variables from non-probability sample

svydesign

a svydesign object

weights

case / frequency weights from non-probability sample

family_outcome

a placeholder (not used in method_nn)

start_outcome

a placeholder (not used in method_nn)

vars_selection

whether variable selection should be conducted

pop_totals

a placeholder (not used in method_nn)

pop_size

population size from the nonprob function

control_outcome

controls passed by the control_out function

control_inference

controls passed by the control_inf function

verbose

parameter passed from the main nonprob function

se

whether standard errors should be calculated

Details

Analytical variance

The variance of the mean is estimated based on the following approach

(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)

This may be estimated using

\hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i),

where \hat{\pi}_B(\boldsymbol{x}_i) is an estimator of propensity scores which we currently estimate using n_A/N (constant) and \hat{\sigma}^2(\boldsymbol{x}_i) is estimated using based on the average of (y_i - y_i^*)^2.

Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator (without assuming that it is consistent) but with good finite population properties. This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and can be summarized as follows:

  1. Sample n_A units from S_A with replacement to create S_A' (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).

  2. Match units from S_B to S_A' to obtain predictions y^*={k}^{-1}\sum_{k}y_k.

  3. Estimate \hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*.

  4. Repeat steps 1-3 M times (we set M=50 in our simulations; this is hard-coded).

  5. Estimate \hat{V}_1=\text{var}({\hat{\boldsymbol{\mu}}}) obtained from simulations and save it as var_nonprob.

(b) probability part (S_B with size n_B; denoted as var_prob in the result)

This part uses functionalities of the {survey} package and the variance is estimated using the following equation:

\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j},

where y^*_i and y_j^* are values imputed imputed as an average of k-nearest neighbour, i.e. {k}^{-1}\sum_{k}y_k. Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.

Value

an nonprob_method class which is a list with the following entries

model_fitted

RANN::nn2 object

y_nons_pred

predicted values for the non-probablity sample (query to itself)

y_rand_pred

predicted values for the probability sample

coefficients

coefficients for the model (if available)

svydesign

an updated surveydesign2 object (new column y_hat_MI is added)

y_mi_hat

estimated population mean for the target variable

vars_selection

whether variable selection was performed (not implemented, for further development)

var_prob

variance for the probability sample component (if available)

var_nonprob

variance for the non-probability sample component

var_tot

total variance, if possible it should be var_prob+var_nonprob if not, just a scalar

model

model type (character "nn")

family

placeholder for the ⁠NN approach⁠ information

References

Yang, S., Kim, J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58

Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.

Examples


data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1,  weights = ~ weight, strata = ~ size + nace + region, data = jvs)

res_nn <- method_nn(y_nons = admin$single_shift,
                    X_nons = model.matrix(~ region + private + nace + size, admin),
                    X_rand = model.matrix(~ region + private + nace + size, jvs),
                    svydesign = jvs_svy)

res_nn


nonprobsvy documentation built on April 3, 2025, 7:08 p.m.