method_glm: Mass imputation using the generalized linear model method

View source: R/method_glm.R

method_glmR Documentation

Mass imputation using the generalized linear model method

Description

Model for the outcome for the mass imputation estimator using generalized linear models via the stats::glm function. Estimation of the mean is done using S_B probability sample or known population totals.

Usage

method_glm(
  y_nons,
  X_nons,
  X_rand,
  svydesign,
  weights = NULL,
  family_outcome = "gaussian",
  start_outcome = NULL,
  vars_selection = FALSE,
  pop_totals = NULL,
  pop_size = NULL,
  control_outcome = control_out(),
  control_inference = control_inf(),
  verbose = FALSE,
  se = TRUE
)

Arguments

y_nons

target variable from non-probability sample

X_nons

a model.matrix with auxiliary variables from non-probability sample

X_rand

a model.matrix with auxiliary variables from non-probability sample

svydesign

a svydesign object

weights

case / frequency weights from non-probability sample

family_outcome

family for the glm model

start_outcome

start parameters (default NULL)

vars_selection

whether variable selection should be conducted

pop_totals

population totals from the nonprob function

pop_size

population size from the nonprob function

control_outcome

controls passed by the control_out function

control_inference

controls passed by the control_inf function (currently not used, for further development)

verbose

parameter passed from the main nonprob function

se

whether standard errors should be calculated

Details

Analytical variance

The variance of the mean is estimated based on the following approach

(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)

\hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace,

where \hat{e}_i = y_i - m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}}) and

\widehat{\boldsymbol{c}}=\left\lbrace n_B^{-1} \sum_{i \in B} \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right) \boldsymbol{h}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right)^{\prime}\right\rbrace^{-1} N^{-1} \sum_{i \in A} w_i \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right).

Under the linear regression model \boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i and \widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .

(b) probability part (S_B with size n_B; denoted as var_prob in the result)

This part uses functionalities of the {survey} package and the variance is estimated using the following equation:

\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.

Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.

Furthermore, if only population totals/means are known and assumed to be fixed we set \hat{V}_2=0.

Value

an nonprob_method class which is a list with the following entries

model_fitted

fitted model either an glm.fit or cv.ncvreg object

y_nons_pred

predicted values for the non-probablity sample

y_rand_pred

predicted values for the probability sample or population totals

coefficients

coefficients for the model (if available)

svydesign

an updated surveydesign2 object (new column y_hat_MI is added)

y_mi_hat

estimated population mean for the target variable

vars_selection

whether variable selection was performed

var_prob

variance for the probability sample component (if available)

var_nonprob

variance for the non-probability sampl component

var_total

total variance, if possible it should be var_prob+var_nonprob if not, just a scalar

model

model type (character "glm")

family

family type (character "glm")

References

Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 941-963.

Examples


data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1,  weights = ~ weight, strata = ~ size + nace + region, data = jvs)

res_glm <- method_glm(y_nons = admin$single_shift,
                      X_nons = model.matrix(~ region + private + nace + size, admin),
                      X_rand = model.matrix(~ region + private + nace + size, jvs),
                      svydesign = jvs_svy)

res_glm


nonprobsvy documentation built on April 3, 2025, 7:08 p.m.