# varDT: Variance approximation with Deville-Tillé (2005) formula In gustave: A User-Oriented Statistical Toolkit for Analytical Variance Estimation

## Description

`varDT` estimates the variance of the estimator of a total in the case of a balanced sampling design with equal or unequal probabilities using Deville-Tillé (2005) formula. Without balancing variables, it falls back to Deville's (1993) classical approximation. Without balancing variables and with equal probabilities, it falls back to the classical Horvitz-Thompson variance estimator for the total in the case of simple random sampling. Stratification is natively supported.

`var_srs` is a convenience wrapper for the (stratified) simple random sampling case.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11``` ```varDT( y = NULL, pik, x = NULL, strata = NULL, w = NULL, precalc = NULL, id = NULL ) var_srs(y, pik, strata = NULL, w = NULL, precalc = NULL) ```

## Arguments

 `y` A (sparse) numerical matrix of the variable(s) whose variance of their total is to be estimated. `pik` A numerical vector of first-order inclusion probabilities. `x` An optional (sparse) numerical matrix of balancing variable(s). `strata` An optional categorical vector (factor or character) when variance estimation is to be conducted within strata. `w` An optional numerical vector of row weights (see Details). `precalc` A list of pre-calculated results (see Details). `id` A vector of identifiers of the units used in the calculation. Useful when `precalc = TRUE` in order to assess whether the ordering of the `y` data matrix matches the one used at the pre-calculation step.

## Details

`varDT` aims at being the workhorse of most variance estimation conducted with the `gustave` package. It may be used to estimate the variance of the estimator of a total in the case of (stratified) simple random sampling, (stratified) unequal probability sampling and (stratified) balanced sampling. The native integration of stratification based on Matrix::TsparseMatrix allows for significant performance gains compared to higher level vectorizations (`*apply` especially).

Several time-consuming operations (e.g. collinearity-check, matrix inversion) can be pre-calculated in order to speed up the estimation at execution time. This is determined by the value of the parameters `y` and `precalc`:

• if `y` not `NULL` and `precalc` `NULL` : on-the-fly calculation (no pre-calculation).

• if `y` `NULL` and `precalc` `NULL` : pre-calculation whose results are stored in a list of pre-calculated data.

• if `y` not `NULL` and `precalc` not `NULL` : calculation using the list of pre-calculated data.

`w` is a row weight used at the final summation step. It is useful when `varDT` or `var_srs` are used on the second stage of a two-stage sampling design applying the Rao (1975) formula.

## Value

• if `y` is not `NULL` (calculation step) : the estimated variances as a numerical vector of size the number of columns of y.

• if `y` is `NULL` (pre-calculation step) : a list containing pre-calculated data.

## Difference with `varest` from package `sampling`

`varDT` differs from `sampling::varest` in several ways:

• The formula implemented in `varDT` is more general and encompasses balanced sampling.

• Even in its reduced form (without balancing variables), the formula implemented in `varDT` slightly differs from the one implemented in `sampling::varest`. Caron (1998, pp. 178-179) compares the two estimators (`sampling::varest` implements V_2, `varDT` implements V_1).

• `varDT` introduces several optimizations:

• matrixwise operations allow to estimate variance on several interest variables at once

• Matrix::TsparseMatrix capability and the native integration of stratification yield significant performance gains.

• the ability to pre-calculate some time-consuming operations speeds up the estimation at execution time.

• `varDT` does not natively implements the calibration estimator (i.e. the sampling variance estimator that takes into account the effect of calibration). In the context of the `gustave` package, `res_cal` should be called before `varDT` in order to achieve the same result.

Martin Chevalier

## References

Caron N. (1998), "Le logiciel Poulpe : aspects méthodologiques", Actes des Journées de méthodologie statistique http://jms-insee.fr/jms1998s03_1/ Deville, J.-C. (1993), Estimation de la variance pour les enquêtes en deux phases, Manuscript, INSEE, Paris.

Deville, J.-C., Tillé, Y. (2005), "Variance approximation under balanced sampling", Journal of Statistical Planning and Inference, 128, issue 2 569-591

Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37

`res_cal`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70``` ```library(sampling) set.seed(1) # Simple random sampling case N <- 1000 n <- 100 y <- rnorm(N)[as.logical(srswor(n, N))] pik <- rep(n/N, n) varDT(y, pik) sampling::varest(y, pik = pik) N^2 * (1 - n/N) * var(y) / n # Unequal probability sampling case N <- 1000 n <- 100 pik <- runif(N) s <- as.logical(UPsystematic(pik)) y <- rnorm(N)[s] pik <- pik[s] varDT(y, pik) varest(y, pik = pik) # The small difference is expected (see Details). # Balanced sampling case N <- 1000 n <- 100 pik <- runif(N) x <- matrix(rnorm(N*3), ncol = 3) s <- as.logical(samplecube(x, pik)) y <- rnorm(N)[s] pik <- pik[s] x <- x[s, ] varDT(y, pik, x) # Balanced sampling case (variable of interest # among the balancing variables) N <- 1000 n <- 100 pik <- runif(N) y <- rnorm(N) x <- cbind(matrix(rnorm(N*3), ncol = 3), y) s <- as.logical(samplecube(x, pik)) y <- y[s] pik <- pik[s] x <- x[s, ] varDT(y, pik, x) # As expected, the total of the variable of interest is perfectly estimated. # strata argument n <- 100 H <- 2 pik <- runif(n) y <- rnorm(n) strata <- letters[sample.int(H, n, replace = TRUE)] all.equal( varDT(y, pik, strata = strata), varDT(y[strata == "a"], pik[strata == "a"]) + varDT(y[strata == "b"], pik[strata == "b"]) ) # precalc argument n <- 1000 H <- 50 pik <- runif(n) y <- rnorm(n) strata <- sample.int(H, n, replace = TRUE) precalc <- varDT(y = NULL, pik, strata = strata) identical( varDT(y, precalc = precalc), varDT(y, pik, strata = strata) ) ```