varDT: Variance approximation with Deville-Tillé (2005) formula

Description Usage Arguments Details Value Difference with varest from package sampling Author(s) References See Also Examples

View source: R/variance_function.R

Description

varDT estimates the variance of the estimator of a total in the case of a balanced sampling design with equal or unequal probabilities using Deville-Tillé (2005) formula. Without balancing variables, it falls back to Deville's (1993) classical approximation. Without balancing variables and with equal probabilities, it falls back to the classical Horvitz-Thompson variance estimator for the total in the case of simple random sampling. Stratification is natively supported.

var_srs is a convenience wrapper for the (stratified) simple random sampling case.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
varDT(
  y = NULL,
  pik,
  x = NULL,
  strata = NULL,
  w = NULL,
  precalc = NULL,
  id = NULL
)

var_srs(y, pik, strata = NULL, w = NULL, precalc = NULL)

Arguments

y

A (sparse) numerical matrix of the variable(s) whose variance of their total is to be estimated.

pik

A numerical vector of first-order inclusion probabilities.

x

An optional (sparse) numerical matrix of balancing variable(s).

strata

An optional categorical vector (factor or character) when variance estimation is to be conducted within strata.

w

An optional numerical vector of row weights (see Details).

precalc

A list of pre-calculated results (see Details).

id

A vector of identifiers of the units used in the calculation. Useful when precalc = TRUE in order to assess whether the ordering of the y data matrix matches the one used at the pre-calculation step.

Details

varDT aims at being the workhorse of most variance estimation conducted with the gustave package. It may be used to estimate the variance of the estimator of a total in the case of (stratified) simple random sampling, (stratified) unequal probability sampling and (stratified) balanced sampling. The native integration of stratification based on Matrix::TsparseMatrix allows for significant performance gains compared to higher level vectorizations (*apply especially).

Several time-consuming operations (e.g. collinearity-check, matrix inversion) can be pre-calculated in order to speed up the estimation at execution time. This is determined by the value of the parameters y and precalc:

w is a row weight used at the final summation step. It is useful when varDT or var_srs are used on the second stage of a two-stage sampling design applying the Rao (1975) formula.

Value

Difference with varest from package sampling

varDT differs from sampling::varest in several ways:

Author(s)

Martin Chevalier

References

Caron N. (1998), "Le logiciel Poulpe : aspects méthodologiques", Actes des Journées de méthodologie statistique http://jms-insee.fr/jms1998s03_1/ Deville, J.-C. (1993), Estimation de la variance pour les enquêtes en deux phases, Manuscript, INSEE, Paris.

Deville, J.-C., Tillé, Y. (2005), "Variance approximation under balanced sampling", Journal of Statistical Planning and Inference, 128, issue 2 569-591

Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37

See Also

res_cal

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
library(sampling)
set.seed(1)

# Simple random sampling case
N <- 1000
n <- 100
y <- rnorm(N)[as.logical(srswor(n, N))]
pik <- rep(n/N, n)
varDT(y, pik)
sampling::varest(y, pik = pik)
N^2 * (1 - n/N) * var(y) / n

# Unequal probability sampling case
N <- 1000
n <- 100
pik <- runif(N)
s <- as.logical(UPsystematic(pik))
y <- rnorm(N)[s]
pik <- pik[s]
varDT(y, pik)
varest(y, pik = pik)
# The small difference is expected (see Details).

# Balanced sampling case
N <- 1000
n <- 100
pik <- runif(N)
x <- matrix(rnorm(N*3), ncol = 3)
s <- as.logical(samplecube(x, pik))
y <- rnorm(N)[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)

# Balanced sampling case (variable of interest
# among the balancing variables)
N <- 1000
n <- 100
pik <- runif(N)
y <- rnorm(N)
x <- cbind(matrix(rnorm(N*3), ncol = 3), y)
s <- as.logical(samplecube(x, pik))
y <- y[s]
pik <- pik[s]
x <- x[s, ]
varDT(y, pik, x)
# As expected, the total of the variable of interest is perfectly estimated.

# strata argument
n <- 100
H <- 2
pik <- runif(n)
y <- rnorm(n)
strata <- letters[sample.int(H, n, replace = TRUE)]
all.equal(
 varDT(y, pik, strata = strata),
 varDT(y[strata == "a"], pik[strata == "a"]) + varDT(y[strata == "b"], pik[strata == "b"])
)

# precalc argument
n <- 1000
H <- 50
pik <- runif(n)
y <- rnorm(n)
strata <- sample.int(H, n, replace = TRUE)
precalc <- varDT(y = NULL, pik, strata = strata)
identical(
 varDT(y, precalc = precalc),
 varDT(y, pik, strata = strata)
)

gustave documentation built on Nov. 10, 2021, 5:08 p.m.