variance-estimators | R Documentation |
This help page describes variance estimators
which are commonly used for survey samples. These variance estimators
can be used as the basis of the generalized replication methods, implemented
with the functions as_fays_gen_rep_design()
,
as_gen_boot_design()
,
make_fays_gen_rep_factors()
,
or make_gen_boot_factors()
Let s
denote the selected sample of size n
, with elements i=1,\dots,n
.
Element i
in the sample had probability \pi_i
of being included in the sample.
The pair of elements ij
was sampled with probability \pi_{ij}
.
The population total for a variable is denoted Y = \sum_{i \in U}y_i
,
and the Horvitz-Thompson estimator for \hat{Y}
is
denoted \hat{Y} = \sum_{i \in s} y_i/\pi_i
. For convenience,
we denote \breve{y}_i = y_i/\pi_i
.
The true sampling variance of \hat{Y}
is denoted V(\hat{Y})
,
while an estimator of this sampling variance is denoted v(\hat{Y})
.
The Horvitz-Thompson variance estimator:
v(\hat{Y}) = \sum_{i \in s}\sum_{j \in s} (1 - \frac{\pi_i \pi_j}{\pi_{ij}}) \frac{y_i}{\pi_i} \frac{y_j}{\pi_j}
The Yates-Grundy variance estimator:
v(\hat{Y}) = -\frac{1}{2}\sum_{i \in s}\sum_{j \in s} (1 - \frac{\pi_i \pi_j}{\pi_{ij}}) (\frac{y_i}{\pi_i} - \frac{y_j}{\pi_j})^2
The Poisson Horvitz-Thompson variance estimator
is simply the Horvitz-Thompson variance estimator, but
where \pi_{ij}=\pi_i \times \pi_j
, which is the case for Poisson sampling.
The Stratified Multistage SRS variance estimator is the recursive variance estimator proposed by Bellhouse (1985) and used in the 'survey' package's function svyrecvar. In the case of simple random sampling without replacement (with one or more stages), this estimator exactly matches the Horvitz-Thompson estimator.
The estimator can be used for any number of sampling stages. For illustration, we describe its use for two sampling stages.
v(\hat{Y}) = \hat{V}_1 + \hat{V}_2
where
\hat{V}_1 = \sum_{h=1}^{H} (1 - \frac{n_h}{N_h})\frac{n_h}{n_h - 1} \sum_{i=1}^{n_h} (y_{hi.} - \bar{y}_{hi.})^2
and
\hat{V}_2 = \sum_{h=1}^{H} \frac{n_h}{N_h} \sum_{i=1}^{n_h}v_{hi}(y_{hi.})
where n_h
is the number of sampled clusters in stratum h
,
N_h
is the number of population clusters in stratum h
,
y_{hi.}
is the weighted cluster total in cluster i
of stratum h
,
\bar{y}_{hi.}
is the mean weighted cluster total of stratum h
,
(\bar{y}_{hi.} = \frac{1}{n_h}\sum_{i=1}^{n_h}y_{hi.}
), and
v_{hi}(y_{hi.})
is the estimated sampling variance of y_{hi.}
.
The Ultimate Cluster variance estimator is simply the stratified multistage SRS variance estimator, but ignoring variances from later stages of sampling.
v(\hat{Y}) = \hat{V}_1
This is the variance estimator used in the 'survey' package when the user specifies
option(survey.ultimate.cluster = TRUE)
or uses svyrecvar(..., one.stage = TRUE)
.
When the first-stage sampling fractions are small, analysts often omit the finite population corrections (1-\frac{n_h}{N_h})
when using the ultimate cluster estimator.
The SD1 and SD2 variance estimators are "successive difference" estimators sometimes used for systematic sampling designs. Ash (2014) describes each estimator as follows:
\hat{v}_{S D 1}(\hat{Y}) = \left(1-\frac{n}{N}\right) \frac{n}{2(n-1)} \sum_{k=2}^n\left(\breve{y}_k-\breve{y}_{k-1}\right)^2
\hat{v}_{S D 2}(\hat{Y}) = \left(1-\frac{n}{N}\right) \frac{1}{2}\left[\sum_{k=2}^n\left(\breve{y}_k-\breve{y}_{k-1}\right)^2+\left(\breve{y}_n-\breve{y}_1\right)^2\right]
where \breve{y}_k = y_k/\pi_k
is the weighted value of unit k
with selection probability \pi_k
. The SD1 estimator is recommended by Wolter (1984).
The SD2 estimator is the basis of the successive difference replication estimator commonly
used for systematic sampling designs and is more conservative. See Ash (2014) for details.
For multistage samples, SD1 and SD2 are applied to the clusters at each stage, separately by stratum.
For later stages of sampling, the variance estimate from a stratum is multiplied by the product
of sampling fractions from earlier stages of sampling. For example, at a third stage of sampling,
the variance estimate from a third-stage stratum is multiplied by \frac{n_1}{N_1}\frac{n_2}{N_2}
,
which is the product of sampling fractions from the first-stage stratum and second-stage stratum.
The "Beaumont-Emond" variance estimator was proposed by Beaumont and Emond (2022), intended for designs that use fixed-size, unequal-probability random sampling without replacement. The variance estimator is simply the Horvitz-Thompson variance estimator with the following approximation for the joint inclusion probabilities.
\pi_{kl} \approx \pi_k \pi_l \frac{n - 1}{(n-1) + \sqrt{(1-\pi_k)(1-\pi_l)}}
In the case of cluster sampling, this approximation is
applied to the clusters rather than the units within clusters,
with n
denoting the number of sampled clusters. and the probabilities \pi
referring to the cluster's sampling probability. For stratified samples,
the joint probability for units k
and l
in different strata
is simply the product of \pi_k
and \pi_l
.
For multistage samples, this approximation is applied to the clusters at each stage, separately by stratum.
For later stages of sampling, the variance estimate from a stratum is multiplied by the product
of sampling probabilities from earlier stages of sampling. For example, at a third stage of sampling,
the variance estimate from a third-stage stratum is multiplied by \pi_1 \times \pi_{(2 | 1)}
,
where \pi_1
is the sampling probability of the first-stage unit
and \pi_{(2|1)}
is the sampling probability of the second-stage unit
within the first-stage unit.
The "Deville-1" and "Deville-2" variance estimators are clearly described in Matei and Tillé (2005), and are intended for designs that use fixed-size, unequal-probability random sampling without replacement. These variance estimators have been shown to be effective for designs that use a fixed sample size with a high-entropy sampling method. This includes most PPSWOR sampling methods, but unequal-probability systematic sampling is an important exception.
These variance estimators take the following form:
\hat{v}(\hat{Y}) = \sum_{i=1}^{n} c_i (\breve{y}_i - \frac{1}{\sum_{i=k}^{n}c_k}\sum_{k=1}^{n}c_k \breve{y}_k)^2
where \breve{y}_i = y_i/\pi_i
is the weighted value of the the variable of interest,
and c_i
depend on the method used:
"Deville-1":
c_i=\left(1-\pi_i\right) \frac{n}{n-1}
"Deville-2":
c_i = (1-\pi_i) \left[1 - \sum_{k=1}^{n} \left(\frac{1-\pi_k}{\sum_{k=1}^{n}(1-\pi_k)}\right)^2 \right]^{-1}
In the case of simple random sampling without replacement (SRSWOR), these estimators are both identical to the usual stratified multistage SRS estimator (which is itself a special case of the Horvitz-Thompson estimator).
For multistage samples, "Deville-1" and "Deville-2" are applied to the clusters at each stage, separately by stratum.
For later stages of sampling, the variance estimate from a stratum is multiplied by the product
of sampling probabilities from earlier stages of sampling. For example, at a third stage of sampling,
the variance estimate from a third-stage stratum is multiplied by \pi_1 \times \pi_{(2 | 1)}
,
where \pi_1
is the sampling probability of the first-stage unit
and \pi_{(2|1)}
is the sampling probability of the second-stage unit
within the first-stage unit.
This kernel-based variance estimator was proposed by Breidt, Opsomer, and Sanchez-Borrego (2016), for use with samples selected using systematic sampling or where only a single sampling unit is selected from each stratum (sometimes referred to as "fine stratification").
Suppose there are n
sampled units, and
for each unit i
there is a numeric population characteristic x_i
and there is a weighted total \hat{Y}_i
, where
\hat{Y}_i
is only observed in the selected sample but x_i
is known prior to sampling.
The variance estimator has the following form:
\hat{V}_{ker}=\frac{1}{C_d} \sum_{i=1}^n (\hat{Y}_i-\sum_{j=1}^n d_j(i) \hat{Y}_j)^2
The terms d_j(i)
are kernel weights given by
d_j(i)=\frac{K(\frac{x_i-x_j}{h})}{\sum_{j=1}^n K(\frac{x_i-x_j}{h})}
where K(\cdot)
is a symmetric, bounded kernel function
and h
is a bandwidth parameter. The normalizing constant C_d
is computed as:
C_d=\frac{1}{n} \sum_{i=1}^n(1-2 d_i(i)+\sum_{j=1}^H d_j^2(i))
For most functions in the 'svrep' package, the kernel function
is the Epanechnikov kernel and the bandwidth is automatically selected
to yield the smallest possible nonempty kernel window, as was recommended
by Breidt, Opsomer, and Sanchez-Borrego (2016). That's the case for
the functions as_fays_gen_rep_design()
, as_gen_boot_design()
,
make_quad_form_matrix()
, etc. However, users can construct the quadratic
form matrix of this variance estimator using a different kernel and a different bandwidth
by directly working with the function make_kernel_var_matrix()
.
See Section 6.8 of Tillé (2020) for more detail on this estimator, including an explanation of its quadratic form. See Deville and Tillé (2005) for the results of a simulation study comparing this and other alternative estimators for balanced sampling.
The estimator can be written as follows:
v(\hat{Y})=\sum_{k \in S} \frac{c_k}{\pi_k^2}\left(y_k-\hat{y}_k^*\right)^2,
where
\hat{y}_k^*=\mathbf{z}_k^{\top}\left(\sum_{\ell \in S} c_{\ell} \frac{\mathbf{z}_{\ell} \mathbf{z}_{\ell}^{\prime}}{\pi_{\ell}^2}\right)^{-1} \sum_{\ell \in S} c_{\ell} \frac{\mathbf{z}_{\ell} y_{\ell}}{\pi_{\ell}^2}
and \mathbf{z}_k
denotes the vector of auxiliary variables for observation k
included in sample S
, with inclusion probability \pi_k
. The value c_k
is set to \frac{n}{n-q}(1-\pi_k)
,
where n
is the number of observations and q
is the number of auxiliary variables.
Ash, S. (2014). "Using successive difference replication for estimating variances." Survey Methodology, Statistics Canada, 40(1), 47–59.
Beaumont, J.-F.; Émond, N. (2022). "A Bootstrap Variance Estimation Method for Multistage Sampling and Two-Phase Sampling When Poisson Sampling Is Used at the Second Phase." Stats, 5: 339–357. https://doi.org/10.3390/stats5020019
Bellhouse, D.R. (1985). "Computing Methods for Variance Estimation in Complex Surveys." Journal of Official Statistics, Vol.1, No.3.
Breidt, F. J., Opsomer, J. D., & Sanchez-Borrego, I. (2016). "Nonparametric Variance Estimation Under Fine Stratification: An Alternative to Collapsed Strata." Journal of the American Statistical Association, 111(514), 822–833. https://doi.org/10.1080/01621459.2015.1058264
Deville, J.‐C., and Tillé, Y. (2005). "Variance approximation under balanced sampling." Journal of Statistical Planning and Inference, 128, 569–591.
Tillé, Y. (2020). "Sampling and estimation from finite populations." (I. Hekimi, Trans.). Wiley.
Matei, Alina, and Yves Tillé. (2005). “Evaluation of Variance Approximations and Estimators in Maximum Entropy Sampling with Unequal Probability and Fixed Sample Size.” Journal of Official Statistics, 21(4):543–70.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.