options(width=400) knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=4) # Install ReGenesees if not already installed if (requireNamespace("ReGenesees", quietly = TRUE)) { svystat <- ReGenesees::svystat } else { stop("The package ReGenesees is needed. \nInstall it by executing the following: \ndevtools::install_github('DiegoZardetto/ReGenesees')") } library(ReGenesees) library(R2BEAT) #library(plyr) library(sampling) options(warn=-1) options(scipen=9999)
This vignette describes a generalized procedure making use of the methods implemented in the R package developed in the Italian National Institute, namely R2BEAT ("Multistage Sampling Allocation and PSU selection").
This package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), and also to perform a selection of the PSUs such that the final sample of SSU is of the self-weighting type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs, or at least those of minimum variability.
This general flow assumes that at least a previous round of the survey, whose sampling design has to be optimized, is available, and is characterized by the following steps:
Perform externally the definition of the sample design, and possibly of the calibration step, using the R package ReGenesees, and make the design object and the calibrated object available.
The workspace to be loaded (R2BEAT_ReGenesees.RData) is available at the link:
https://github.com/barcaroli/R2BEAT/tree/master/data
load("R2BEAT_ReGenesees.RData") # ReGenesees design object
This is the 'design' object:
des
and this is the calibrated object:
cal
It is advisable to check the presence of lonely strata:
# Control the presence of strata with less than two units ls <- find.lon.strata(des)
In case, provide to collapse and re-do the calibration.
In this example, in the ReGenesees objects there are the following variables:
str(des$variables)
where there are three potential target variables:
summary(des$variables$income_hh)
table(des$variables$work)
table(des$variables$unemployed)
Great attention must be paid to the nature of the target variables, especially of the 'factor' type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of '1' in the population. If factor variables are of other nature, then an error message is printed.
Using ReGenesees objects as input, produce the following dataframes (function 'input_to_beat.2st_1'):
a) the 'stratif' dataframe containing:
b) the 'deff' (design effect) dataframe, containing the following information:
c) the 'effst' (estimator effect) dataframe, containing the following information:
d) the 'rho' (intraclass coefficient of correlation) dataframe, containing the following information:
Actually, the 'deff' dataframe is not used in the following steps, it just remains for documentation purposes.
Here is the way we can produce the above items:
load("pop.RData") samp_frame <- pop RGdes <- des RGcal <- cal strata_var <- c("stratum") target_vars <- c("income_hh", "active", "inactive", "unemployed") weight_var <- "weight" deff_var <- "stratum" id_PSU <- c("municipality") id_SSU <- c("id_hh") domain_var <- c("region") delta <- 1 minimum <- 25 inp <- prepareInputToAllocation2( samp_frame, # sampling frame RGdes, # ReGenesees design object RGcal, # ReGenesees calibrated object id_PSU, # identification variable of PSUs id_SSU, # identification variable of SSUs strata_var, # strata variables target_vars, # target variables deff_var, # deff variables domain_var, # domain variables delta, # Average number of SSUs for each selection unit minimum # Minimum number of SSUs to be selected in each PSU )
and these are the results:
head(inp$strata)
head(inp$deff)
head(inp$effst)
head(inp$rho)
head(inp$psu_file)
head(inp$des_file)
It may happen that the population in strata (variable 'N' in 'inp1\$strata' dataset) and the one derived by the PSU dataset (variable 'STRAT_MOS' in 'inp2\$des_file' dataset) are not the same.
We can check it by applying the function 'check_input' in this way:
newstrata <- check_input(strata=inp$strata, des=inp$des_file, strata_var_strata="STRATUM", strata_var_des="STRATUM")
Together with the print of the differences between the two populations, the function produces a new version of the strata dataset, where the population has been changed to the one derived by the PSUs dataset.
It is preferable to use this new version:
inp$strata <- newstrata
Using the function 'beat.2st' in 'R2BEAT' package execute the optimization of PSU and SSU allocation in strata:
cv <- as.data.frame(list(DOM=c("DOM1","DOM2"), CV1=c(0.02,0.03), CV2=c(0.03,0.05), CV3=c(0.03,0.05), CV4=c(0.05,0.08))) cv
set.seed(1234) minPSUstrat <- 2 inp$des_file$MINIMUM <- 25 alloc <- beat.2st(stratif = inp$strata, errors = cv, des_file = inp$des_file, psu_file = inp$psu_file, rho = inp$rho, deft_start = NULL, effst = inp$effst, minnumstrat = 2, minPSUstrat)
This is the sensitivity of the solution:
alloc$sensitivity
i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.
These are the expected values of the coefficients of variation:
alloc$expected
Using the function 'select_PSU' execute the selection of PSU in strata:
set.seed(1234) sample_1st <- select_PSU(alloc, type="ALLOC", pps=TRUE, plot=TRUE)
This is the overall sample design:
sample_1st$PSU_stats
Finally, we are able to select the Secondary Sample Units (the individuals) from the already selected PSUs (the municipalities). We proceed to select the sample in this way:
samp <- select_SSU(df=pop, PSU_code="municipality", SSU_code="id_ind", PSU_sampled=sample_1st$sample_PSU)
To check that the total amount of selected units with respect to the initial allocation:
nrow(samp) sum(alloc$alloc$ALLOC[-nrow(alloc$alloc)])
The difference is due to the fact that the constraint on the minimum number of SSUs to be selected for PSU has been enforced, thus resulting in an increase of the SSUs with respect to the optimal allocation.
We check also that the sum of weights equalizes the population size:
nrow(pop) sum(samp$weight)
This is the distribution of weights:
par(mfrow=c(1, 2)) boxplot(samp$weight,col="orange") title("Weights distribution (total sample)",cex.main=0.7) boxplot(weight ~ region, data=samp,col="orange") title("Weights distribution by region",cex.main=0.7) boxplot(weight ~ province, data=samp,col="orange") title("Weights distribution by province",cex.main=0.7) boxplot(weight ~ stratum, data=samp,col="orange") title("Weights distribution by stratum",cex.main=0.7)
It can be seen that the sample is fully self-weighted inside strata, and approximately self-weighted in aggregations of strata, that is the result we wanted to obtain.
# add this chunk to end of mycode.rmd file.rename(from="R2BEAT_workflow.md", to="out.md")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.