method_npar: Mass imputation using non-parametric model method
In nonprobsvy: Inference Based on Non-Probability Samples

method_npar

R Documentation

Mass imputation using non-parametric model method

Description

Model for the outcome for the mass imputation estimator using loess via stats::loess. Estimation of the mean is done using the S_B probability sample.

Usage

method_npar(
  y_nons,
  X_nons,
  X_rand,
  svydesign,
  weights = NULL,
  family_outcome = "gaussian",
  start_outcome = NULL,
  vars_selection = FALSE,
  pop_totals = NULL,
  pop_size = NULL,
  control_outcome = control_out(),
  control_inference = control_inf(),
  verbose = FALSE,
  se = TRUE
)

Arguments

`y_nons`	target variable from non-probability sample
`X_nons`	a `model.matrix` with auxiliary variables from non-probability sample
`X_rand`	a `model.matrix` with auxiliary variables from non-probability sample
`svydesign`	a svydesign object
`weights`	case / frequency weights from non-probability sample (default NULL)
`family_outcome`	family for the glm model)
`start_outcome`	a place holder (not used in `method_npar`)
`vars_selection`	whether variable selection should be conducted
`pop_totals`	a place holder (not used in `method_npar`)
`pop_size`	population size from the `nonprob` function
`control_outcome`	controls passed by the `control_out` function
`control_inference`	controls passed by the `control_inf` function
`verbose`	parameter passed from the main `nonprob` function
`se`	whether standard errors should be calculated

Details

Analytical variance

The variance of the mean is estimated based on the following approach

(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)

\hat{V}_1 = \frac{1}{N^2} \sum_{i=1}^{n_A} \left\lbrace\hat{g}_B(\boldsymbol{x}_i)\right\rbrace^{2} \hat{e}_i^2,

where \hat{e}_i=y_i - \hat{m}(x_i) is the residual and \hat{g}_B(\boldsymbol{x}_i) = \left\lbrace \pi_B(\boldsymbol{x}_i) \right\rbrace^{-1} can be estimated various ways. In the package we estimate \hat{g}_B(\boldsymbol{x}_i) using \pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x}) as suggested by Chen et al. (2022, p. 6). In particular, we currently support this using stats::loesswith"gaussian"' family.

(b) probability part (S_B with size n_B; denoted as var_prob in the result)

This part uses functionalities of the {survey} package and the variance is estimated using the following equation:

\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{\hat{m}(x_i)}{\pi_i} \frac{\hat{m}(x_j)}{\pi_j}.

Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.

Value

an nonprob_method class which is a list with the following entries

model_fitted: fitted model object returned by stats::loess
y_nons_pred: predicted values for the non-probablity sample
y_rand_pred: predicted values for the probability sample or population totals
coefficients: coefficients for the model (if available)
svydesign: an updated surveydesign2 object (new column y_hat_MI is added)
y_mi_hat: estimated population mean for the target variable
vars_selection: whether variable selection was performed
var_prob: variance for the probability sample component (if available)
var_nonprob: variance for the non-probability sampl component
model: model type (character "npar")

References

Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1), 1-24.

Examples


set.seed(123123123)
N <- 10000
n_a <- 500
n_b <- 1000
n_b1 <- 0.7*n_b
n_b2 <- 0.3*n_b
x1 <- rnorm(N, 2, 1)
x2 <- rnorm(N, 2, 1)
y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1)
y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1)
strata <- x1 <= 2
pop <- data.frame(x1, x2, y1, y2, strata)
sample_a <- pop[sample(1:N, n_a),]
sample_a$w_a <- N/n_a
sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a)
pop1 <- subset(pop, strata == TRUE)
pop2 <- subset(pop, strata == FALSE)
sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ],
                  pop2[sample(1:nrow(pop2), n_b2), ])
res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2,
                      data = sample_b,
                      svydesign = sample_a_svy,
                      method_outcome = "npar")
res_y_npar

nonprobsvy documentation built on June 8, 2025, 12:36 p.m.