In barcaroli/R2BEATold: Multistage Samplign Design and PSUs selection

options(width=400)
knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=4)
# Install ReGenesees if not already installed
  if (requireNamespace("ReGenesees", quietly = TRUE)) {
    svystat    <- ReGenesees::svystat
  } else {
    stop("The package ReGenesees is needed. \nInstall it by executing the following: \ndevtools::install_github('DiegoZardetto/ReGenesees')")
  }
library(ReGenesees)
library(R2BEAT)
library(plyr)
library(sampling)
options(warn=-1)
options(scipen=9999)

This vignette describes a generalized procedure making use of the methods implemented in the R package developed in the Italian National Institute, namely R2BEAT ("Multistage Sampling Allocation and PSU selection").

This package allows to determine the optimal allocation of both Primary Stage Units (PSUs) and Secondary Stage Units (SSU), and also to perform a selection of the PSUs such that the final sample of SSU is of the self-weighting type, i.e. the total inclusion probabilities (as resulting from the product between the inclusion probabilities of the PSUs and those of the SSUs) are near equal for all SSUs.

This general flow assumes that at least a previous round of the survey, whose sampling design has to be optimized, is available, and is characterized by the following steps:

Use of ReGenesees

Perform externally the definition of the sample design, and possibly of the calibration step, using the R package ReGenesees, and make the design object and the calibrated object available.

load("R2BEAT_ReGenesees.RData")   # ReGenesees design object

This is the 'design' object:

des

and this is the calibrated object:

cal

It is advisable to check the presence of lonely strata:

# Control the presence of strata with less than two units
ls <- find.lon.strata(des)

In case, provide to collapse and re-do the calibration.

In this example, in the ReGenesees objects there are the following variables:

str(des$variables)

where there are three potential target variables:

income_hh (numeric);
work (factor with values 0, 1, 2);
unemployed (factor with values 0, 1).

summary(des$variables$income_hh)

table(des$variables$work)

table(des$variables$unemployed)

Great attention must be paid to the nature of the target variables, especially of the 'factor' type. In fact, the procedure here illustrated is suitable only when categorical variables are binary with values 0 and 1, supposing we are willing to estimate proportions of '1' in the population. If factor variables are of other nature, then an error message is printed.

Therefore, we have to handle the 'work' variable in this way: as values 0, 1 and 2 indicate respectively non labour force, active and inactive people, we can decide to derive from 'work' two binary variables, 'active' and 'inactive':

des<-des.addvars(des,active=factor(ifelse(work==1,1,0)))
des<-des.addvars(des,inactive=factor(ifelse(work==2,1,0)))
cal<-des.addvars(cal,active=factor(ifelse(work==1,1,0)))
cal<-des.addvars(cal,inactive=factor(ifelse(work==2,1,0)))

Now, all the categorical target variables are compliant to the binary constraint:

table(cal$variables$active)

table(cal$variables$inactive)

table(cal$variables$unemployed)

Build 'strata', 'deff', 'effst' and 'rho' dataframes

Using ReGenesees objects as input, produce the following dataframes (function 'input_to_beat.2st_1'):

a) the 'stratif' dataframe containing:

STRATUM: identifier of the single stratum
N: total population in terms of final sampling units
Mi,Si: mean and standard deviation of target variables (i=1,2,..,P)
DOMk: domain(s) to which the stratum belongs

b) the 'deff' (design effect) dataframe, containing the following information:

STRATUM: the stratum identifier
DEFFi: the design effect for each target variable i (i=1,2,...,P)

c) the 'effst' (estimator effect) dataframe, containing the following information:

STRATUM: the stratum identifier
EFFSTi: the estimator effect for each target variable i (i=1,2,...,P)

d) the 'rho' (intraclass coefficient of correlation) dataframe, containing the following information:

STRATUM: the stratum identifier
RHO_ARi: the intraclass coefficient of correlation in self-representative PSUs for each target variable i (i=1,2,...,P)
RHO_NARi: the intraclass coefficient of correlation in non self-representative PSUs for each target variable i (i=1,2,...,P)

Actually, the 'deff' dataframe is not used in the following steps, it just remains for documentation purposes.

Here is the way we can produce the above items:

RGdes <- des                           # ReGenesees design object
RGcal <- cal                           # ReGenesees calibrated object

strata_vars <- c("stratum")            # variables of stratification
target_vars <- c("income_hh",
                 "active",
                 "inactive",
                 "unemployed")         # target variables
deff_vars <- "stratum"                 # stratification variables to be used when calculating deff and effst 
                                       #    (n.b: must coincide or be a subset of variables of stratification)
id_PSU <- c("municipality")            # identification variable of PSUs
id_SSU <- c("id_hh")                   # identification variable of SSUs
domain_vars <- c("region")             # domain variables
inp1 <- input_to_beat.2st_1(RGdes,
                            RGcal,
                            id_PSU,
                            id_SSU,
                            strata_vars,
                            target_vars,
                            deff_vars,
                            domain_vars)

and these are the results:

head(inp1$strata)

head(inp1$deff)

head(inp1$effst)

head(inp1$rho)

Build 'PSU' and 'design' dataframes

Prepare the inputs related to the PSUs (function 'input_to_strat.2d_2'), that are

a) the 'des_file' dataframe, containing the following information:

STRATUM: stratum identifier
MOS: measure of size of the stratum (in terms of number of contained selection units)
DELTA: factor that report the average number of SSUs for each selection unit
MINIMUM: minimum number of units to be selected in each PSU

b) the 'PSU_file' dataframe, containing the following information:

stratum identifier
PSU id
PSU_MOS: number of selection units contained in a given PSU

# psu <- read.csv2("psu.csv") # Read the external file containing PSU information
head(psu)
psu_id="municipality"        # Identifier of the PSU
stratum_var="stratum"        # Identifier of the stratum
mos_var="ind"                # Variable to be used as 'measure of size'
delta=1                      # Average number of SSUs for each selection unit
minimum <- 50                # Minimum number of SSUs to be selected in each PSU
inp2 <- input_to_beat.2st_2(psu,
                            psu_id,
                            stratum_var,
                            mos_var,
                            delta,
                            minimum)
head(inp2$psu_file)
head(inp2$des_file)

Check the coherence of populations in strata and PSUs

It may happen that the population in strata (variable 'N' in 'inp1\$strata' dataset) and the one derived by the PSU dataset (variable 'STRAT_MOS' in 'inp2\$des_file' dataset) are not the same.

We can check it by applying the function 'check_input' in this way:

newstrata <- check_input(strata=inp1$strata,
                         des=inp2$des_file,
                         strata_var_strata="STRATUM",
                         strata_var_des="STRATUM")

Together with the print of the differences between the two populations, the function produces a new version of the strata dataset, where the population has been changed to the one derived by the PSUs dataset.

It is preferable to use this new version:

inp1$strata <- newstrata

Optimal allocation of units in each stratum

Using the function 'beat.2st' in 'R2BEAT' package execute the optimization of PSU and SSU allocation in strata:

cv <- as.data.frame(list(DOM=c("DOM1","DOM2"),
                         CV1=c(0.03,0.04),
                         CV2=c(0.06,0.08),
                         CV3=c(0.06,0.08),
                         CV4=c(0.06,0.08)))
cv

stratif = inp1$strata 
errors = cv 
des_file = inp2$des_file 
psu_file = inp2$psu_file 
rho = inp1$rho 
effst = inp1$effst

alloc <- beat.2st(stratif, 
                  errors, 
                  des_file, 
                  psu_file, 
                  rho, 
                  deft_start = NULL, 
                  effst,
                  epsilon1 = 5, 
                  mmdiff_deft = 1,maxi = 15, 
                  epsilon = 10^(-11), minnumstrat = 2, maxiter = 200, maxiter1 = 25)

This is the sensitivity of the solution:

alloc$sensitivity

i.e., for each domain value and for each variable it is reported the gain in terms of reduction in the sample size if the corresponding precision constraint is reduced of 10%.

This are the expected values of the coefficients of variation:

alloc$expected

Selection of PSUs

Using the function 'StratSel' execute the selection of PSU in strata:

allocat <- alloc$alloc[-nrow(alloc$alloc),]
sample_2st <- StratSel(dataPop= inp2$psu_file,
                       idpsu= ~ PSU_ID, 
                       dom= ~ STRATUM, 
                       final_pop= ~ PSU_MOS, 
                       size= ~ PSU_MOS, 
                       PSUsamplestratum= 1, 
                       min_sample= minimum, 
                       min_sample_index= FALSE, 
                       dataAll=allocat,
                       domAll= ~ factor(STRATUM), 
                       f_sample= ~ ALLOC, 
                       planned_min_sample= NULL, 
                       launch= F)

This is the overall sample design:

sample_2st[[2]]

des <- sample_2st[[2]]
des <- des[1:(nrow(des)-1),]
strat <- c(as.character(as.numeric(des$Domain[1:(nrow(des)-1)])),"Tot")
barplot(t(des[1:(nrow(des)),2:3]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated PSUs by domain")

barplot(t(des[1:(nrow(des)),5:6]), names=strat,
        col=c("darkblue","red"), las=2, xlab = "Stratum", cex.axis=0.7, 
        cex.names=0.7)
legend("topleft", 
       legend = c("Self Representative","Non Self Representative"),
       fill = c("darkblue", "red"))
title("Distribution of allocated SSUs by domain")

and these are the selected PSUs:

selected_PSU <- sample_2st[[4]]
selected_PSU <- selected_PSU[selected_PSU$PSU_final_sample_unit > 0,]
write.table(sample_2st[[4]],"Selected_PSUs.csv",sep=";",row.names=F,quote=F)
head(selected_PSU)