nonprobsvy
: an R package for modern statistical inference methods based on non-probability samples The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
The package allows for:
ncvreg
,
Rcpp
, RcppArmadillo
packages),survey
and srvyr
packages when probability
sample is available (Lumley 2004,
2023; Freedman Ellis and Schneider
2024),logit
, probit
and cloglog
) and
outcome (gaussian
, binomial
and poisson
) variables.Details on the use of the package can be found:
You can install the recent version of nonprobsvy
package from main
branch Github with:
remotes::install_github("ncn-foreigners/nonprobsvy")
or install the stable version from CRAN
install.packages("nonprobsvy")
or development version from the dev
branch
remotes::install_github("ncn-foreigners/nonprobsvy@dev")
Consider the following setting where two samples are available: non-probability (denoted as $S_A$ ) and probability (denoted as $S_B$) where set of auxiliary variables (denoted as $\boldsymbol{X}$) is available for both sources while $Y$ and $\boldsymbol{d}$ (or $\boldsymbol{w}$) is present only in probability sample.
| Sample | | Auxiliary variables $\boldsymbol{X}$ | Target variable $Y$ | Design ($\boldsymbol{d}$) or calibrated ($\boldsymbol{w}$) weights | |----|---:|:--:|:--:|:--:| | $S_A$ (non-probability) | 1 | $\checkmark$ | $\checkmark$ | ? | | | … | $\checkmark$ | $\checkmark$ | ? | | | $n_A$ | $\checkmark$ | $\checkmark$ | ? | | $S_B$ (probability) | $n_A+1$ | $\checkmark$ | ? | $\checkmark$ | | | … | $\checkmark$ | ? | $\checkmark$ | | | $n_A+n_B$ | $\checkmark$ | ? | $\checkmark$ |
Suppose $Y$ is the target variable, $\boldsymbol{X}$ is a matrix of auxiliary variables, $R$ is the inclusion indicator. Then, if we are interested in estimating the mean $\bar{\tau}_Y$ or the sum $\tau_Y$ of the of the target variable given the observed data set $(y_k, \boldsymbol{x}_k, R_k)$, we can approach this problem with the possible scenarios:
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs, pop_size = N)
base_w_bd <- N/sum(population$flag_bd1)
Declare svydesign
object with survey
package
sample_prob <- svydesign(ids= ~1, weights = ~ base_w_srs,
data = subset(population, flag_srs == 1),
fpc = ~ pop_size)
sample_prob
#> Independent Sampling design
#> svydesign(ids = ~1, weights = ~base_w_srs, data = subset(population,
#> flag_srs == 1), fpc = ~pop_size)
or with the srvyr
package
sample_prob <- srvyr::as_survey_design(.data = subset(population, flag_srs == 1),
weights = base_w_srs)
sample_prob
Independent Sampling design (with replacement)
Called via srvyr
Sampling variables:
Data variables:
- x1 (dbl), x2 (dbl), y1 (dbl), y2 (dbl), p1 (dbl), p2 (dbl), base_w_srs (dbl), flag_bd1 (int), flag_srs (dbl)
Estimate population mean of y1
based on doubly robust estimator using
IPW with calibration constraints and we specify that auxiliary variables
should not be combined for the inference.
result_dr <- nonprob(
selection = ~ x2,
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_dr
#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9500 (se=0.0414, ci=(2.8689, 3.0312))
#> - variable y2: 1.5762 (se=0.0498, ci=(1.4786, 1.6739))
Mass imputation estimator
result_mi <- nonprob(
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)
Results
result_mi
#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9498 (se=0.0420, ci=(2.8675, 3.0321))
#> - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6398))
Inverse probability weighting estimator
result_ipw <- nonprob(
selection = ~ x2,
target = ~y1+y2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)
Results
result_ipw
#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9981 (se=0.0137, ci=(2.9713, 3.0249))
#> - variable y2: 1.5906 (se=0.0137, ci=(1.5639, 1.6174))
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.