if(!exists(".initOK")) source("fonctions.R")
gustave
package {.unnumbered}\vspace{-0.5cm} Variance estimation is becoming a \strong{more and more important part} of the survey production process:
\pause \bigskip It is also a \strong{complex issue} from a \strong{technical} and \strong{organizational} point of view.
\pause \bigskip This talk:
gustave
package\large \tableofcontents[sectionstyle=show/show, subsectionstyle=hide/hide/hide]
Sampling design is the first source of complexity for exact analytical variance estimation:
\pause \bigskip \strong{Remarks}
In order to be exact, variance estimation should also take into account the estimation methodology implemented:
\strong{non-response correction}: the unit non-response process can be described as an additional Poisson sampling phase
\strong{calibration on margins}: calibration on margins yields significant variance reduction provided the auxiliary variables are well correlated with the variables of interest
\pause \bigskip \strong{Remark} Depending on the imputation methodology, \strong{item non-response} can also be taken into account in the variance estimation process.
A final source of complexity comes from the indicators whose variance is to be estimated:
\strong{non-linear indicators}: mean, ratio, quantiles, Gini coefficient, At-risk of poverty rate, etc. $\rightarrow$ \strong{linearization techniques}
\strong{domain estimations}: regional estimates
\pause \strong{Remark} Variance estimation with small area estimation methodology is beyond the scope of this talk.
\pause \bigskip \strong{New challenges}: in order to be used in practice, variance estimation programs should be
gustave
package\strong{G}ustave: a \strong{U}ser-oriented \strong{S}tatistical \strong{T}oolkit for \strong{A}nalytical \strong{V}ariance \strong{E}stimation
\pause \bigskip R package currently available on GitHub: \url{https://github.com/martinchevalier/gustave}
\bigskip \pause \strong{Goal} Make variance estimation as easy as possible in providing the methodologist with dedicated tools.
The French EU-SILC:
\bigskip \pause Once processed by the gustave
package: \strong{one single standalone data file per year}.
load("output_for_beamer.RData")
\footnotesize
# Load the standalone .RData file load("silc_bpw/precisionSilc14.RData") # Note : the 4 data files of the survey are loaded # together with the precisionSilc() function # Mean equivalised disposable income per household precisionSilc(h, HX090)
\pause
o1
# The variance estimation takes about 1 second.
\footnotesize
# Share of households below a given (absolute) threshold # (see below for relative threshold e.g. arpr) precisionSilc(h, HX090 < 20000) # Work intensity precisionSilc(r, as.factor(RX050))
\pause
options(width = 70) o3[, 1:4] options(width = 55)
\pause
# Number of people with severe material deprivation precisionSilc(r, total(RX060 == 1)) # Ratio of equivalised disposable income # per person in the household precisionSilc(h, ratio(HX090, HX040))
\pause \footnotesize
# Domain estimation with where (FR10 : Paris area) precisionSilc(h, HX090, where = DB040 == "FR10")
o6
\pause
# Domain estimation with by precisionSilc(h, HX090, by = DB040)
o7[1:5, 1:4]
vardpoor
package\footnotesize
# Installation of the vardpoor package install.packages("vardpoor") # Gini coefficient precisionSilc(r, gini(HX090))
\vspace{-0.5cm}
o8
\pause
# At-risk of poverty rate precisionSilc(r, arpr(HX090))
\vspace{-0.5cm}
o9
gustave
package starter's guideThe aim of the gustave
package is to hide the complexity of the variance estimation process from the end user.
\pause \bigskip \strong{Basic idea} "Wrapping" the complex variance estimation function within a variance estimation wrapper.
library(gustave) N <- 5000000 n <- 2000 H <- 10 size <- rep(n/H, H) # Randomly generating a sampling frame and first-order # probabilities of inclusion set.seed(1) frame <- data.frame( id = 1:N , stratum = sample.int(H, size = N, replace = TRUE) , auxiliary_var = 10 + rnorm(N) ) # frame$pik <- as.vector(frame$pik * (size / tapply(frame$pik, frame$stratum, sum))[frame$stratum]) # Drawing the sample with sampling::UPmaxentropy() library(sampling) sample <- frame[sampling::strata( data = frame, size = size, stratanames = "stratum" , method = "srswor" )$ID_unit, ] sample$pik <- 1 / rep(N/n, n) sample$d <- 1 / sample$pik sample$w <- sample$d * calib(sample$auxiliary_var, d = sample$d, sum(frame$auxiliary_var), method = "raking") # Randomly generating the data that would have been collected # on the field and storing it in the "survey" data.frame survey <- cbind( sample[, c("id", "w", "auxiliary_var")] , data.frame( var1 = sample$auxiliary_var + rnorm(n)*2 , var2 = letters[sample.int(3, n, replace = TRUE)] )) # Sorting the "sample" and "survey" objects by id sample <- sample[order(sample$id), ] survey <- survey[order(survey$id), ]
\strong{Remark} No unit non-response for this example.
\pause \footnotesize
names(sample) names(survey)
\footnotesize
# Install the gustave package from GitHub install.packages("devtools") devtools::install_github("martinchevalier/gustave") library(gustave)
\pause
devtools::install_github("martinchevalier/gustave", ref = "Consolidation-and-documentation")
library(gustave)
# Define the variance function variance_function <- function(y){ # Taking calibration into account e <- rescal(y, sample$auxiliary_var) # Variance estimation var <- varDT( y = e, pik = sample$pik, strata = sample$stratum ) return(var) } variance_function(survey$var1)
\footnotesize
# Define the variance wrapper variance_wrapper <- define_variance_wrapper( variance_function = variance_function , reference_id = sample$id , default = list(id = "id", weight = "w") , objects_to_include = "sample" ) # Export the survey data and the variance wrapper # for later use or dissemination save( survey, variance_wrapper , file = "precisionSurvey.RData" )
# Removing all files except the survey data # and the variance wrapper rm(list = ls())
\footnotesize
# Load the .RData (for instance in an empty environment) load("precisionSurvey.RData") ls() # Use the variance wrapper variance_wrapper(survey, var1)[, 1:3] variance_wrapper(survey, mean(var1), by = var2)
gustave
package at InseeThe gustave
package is currently \strong{used in production} at Insee for the variance estimation of: EU-SILC, LFS, the French victimation survey (and AES by the end of 2017).
\pause \bigskip The programs it produces are used by survey specialists with no expertise in variance estimation nor in the R software.
\pause \bigskip The Department of statistical methods sees the gustave
package as an \strong{investment}:
vardpoor
packageThe vardpoor
package is developed and maintained by the Central Statistical Bureau of Latvia (presented in the 2015 EU-SILC Best Practices Workshop in London).
\pause It is also dedicated to variance estimation, but with a different focus:
\pause The \strong{linearization functions provided by the \texttt{vardpoor} package can be used within a variance wrapper} produced by the gustave
package through a \strong{linearization wrapper}.
\bigskip \pause A version will be available on CRAN at the latest in august 2018 (available on GitHub as of now)
\bigskip \pause \strong{Key features before the first release}:
gustave
package {.unnumbered}Variance estimation in household surveys is a complex topic, which implies a \strong{strong collaboration between survey specialists and methodologists}.
\bigskip \pause The development of a unifying R package is a source of \strong{efficiency} and improves the \strong{quality} of the whole process.
\bigskip \pause It is an investment that also relies on other \strong{open-source initiatives}, e.g. the vardpoor
package.
\bigskip \pause It stimulates the use of softwares \strong{alternative to SAS} at Insee.
\begin{center} \large \strong{Thank your for your attention!} \ \bigskip \strong{Questions?}
\normalsize \bigskip \bigskip Martin Chevalier \ martin.chevalier@insee.fr \\url{https://github.com/martinchevalier/gustave} \end{center}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.