fill.template: Fill the Known Totals Template for a Calibration Task
In DiegoZardetto/ReGenesees: R Evolved Generalized Software for Sampling Estimates and Errors in Surveys

fill.template

R Documentation

Fill the Known Totals Template for a Calibration Task

Description

Given a template prepared to store the totals of the auxiliary variables for a specific calibration task, computes the actual values of such totals from a sampling frame.

Usage

fill.template(universe, template, mem.frac = 10)

Arguments

`universe`	Data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables (the sampling frame).
`template`	The template for the calibration task, an object of class `pop.totals`.
`mem.frac`	A `numeric` and non-negative value (the default is `10`). It triggers a memory-efficient algorithm when universe is really huge (see ‘Details’ and ‘Performance’).

Details

Recall that a template object returned by function pop.template has a structure that complies with the standard required by e.calibrate, but is empty, in the sense that all the known totals it must be able to store are missing (NA). Whenever these totals are available to the user as such, that is in the form of already computed aggregated values (e.g. because they come from an external source, like a Population Census), the ReGenesees package cannot automatically fill the template. Stated more explicitly: the user himself has to bear the responsibility of putting the right values in the right slots of the prepared template data frame. To this end, function pop.desc could be very helpful.

A lucky alternative arises when a “sampling frame” (that is a data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables) is available. In such cases, indeed, the fill.template function is able to: (i) automatically compute the totals of the auxiliary variables from the universe data frame, (ii) safely arrange and format these values according to the template structure.

Notice that fill.template will perform a complete coherence check between universe and template. If this check fails, the program stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem. Should empty levels be present in any factor variable belonging to universe, they would be dropped.

Argument mem.frac (whose value must be numeric and non-negative) triggers a memory-efficient algorithm when universe is really huge. The only sound reason to ever change the value of this argument from its default (mem.frac=10) is that an invocation of fill.template caused a memory-failure (i.e. a messages beginning cannot allocate vector of size ...) on your machine. In such a case, increasing the value of mem.frac (e.g. mem.frac=20) will provide a better chance of succeeding (for more details, see ‘Performance’ section below).

Value

An object of class pop.totals storing the actual values of the population totals for the specified calibration task, ready to be safely passed to e.calibrate.

Performance

Real-world calibration tasks (e.g. in the field of Official Statistics) can simultaneously involve hundreds of auxiliary variables and refer to target populations of several million units. In such circumstances, the naive aggregation of the calibration model.matrix of universe may turn out to be too memory-demanding (at least in ordinary PC environments) and determine a memory-failure error.

The alternative implemented in fill.template is to: (i) split universe in chunks, (ii) compute partial sums of auxiliary variables chunk-by-chunk, (iii) update template by adding progressively such partial sums. This alternative is triggered by parameter mem.frac, which also implicitly controls the number of chunks. The function estimates the memory that would be used to store the full model.matrix of universe and compares it to 4 GB: if the resulting ratio is bigger than 1/mem.frac, the memory-efficient algorithm starts; the number of chunks in which universe will then be split is determined in such a way that the memory needed to store the model.matrix of each chunk does not exceed a fraction 1/mem.frac of 4 GB.

Whenever fill.template switches to the memory-efficient "chunking" algorithm, a warning message will signal it and will specify as well the number of chunks that are being processed.

Author(s)

Diego Zardetto

References

Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.

Examples

# Load sbs data:
data(sbs)

# Build a design object:
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,fpc=~fpc)


###########################
# A simple example first. #
###########################

# Suppose you want to calibrate on the enterprise counts inside areas
  # 1) Build the population totals template:
pop<-pop.template(sbsdes, calmodel=~area-1)

 # Note: given the dimension of the obtained template...
dim(pop)

 # ...the number of known totals to be stored is 24 (one for each area).
 
 # 2) Use the fill.template function to (i) automatically compute
 #    such 24 totals from the universe (sbs.frame) and (ii) safely fill
 #    the template:
pop<-fill.template(universe=sbs.frame,template=pop) 
pop

 # 3) Lastly calibrate, e.g. with the unbounded linear distance and
 #    heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf)) 


########################################
# A more involved (two-sided) example. #
########################################

# Now suppose you have to perform a calibration process which
# exploits as auxiliary information the total number of employees (emp.num)
# and enterprises (ent) inside the domains obtained by:
#  i) crossing nace2 and region;
# ii) crossing emp.cl, region and nace.macro;

# Due to the fact that nace2 is nested into nace.macro,
# the calibration model can be efficiently factorized as follows:
## 1) Add to the design object and universe the new compressed
 #    factor variable involving nested factors, namely:
sbsdes<-des.addvars(sbsdes,nace2.in.nace.macro=nace2 %into% nace.macro)
sbs.frame$nace2.in.nace.macro<-sbs.frame$nace2 %into% sbs.frame$nace.macro

  # 2) Build the template exploiting the new variable:
pop<-pop.template(sbsdes,
     calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
     partition=~nace.macro:region)

 # Note: given the dimension of the obtained template...
dim(pop)

 # ...the number of known totals to be stored is 792.
 
 # 3) Use the fill.template function to (i) automatically compute
 #    such 792 totals from the universe (sbs.frame) and (ii) safely fill
 #    the template:
pop<-fill.template(universe=sbs.frame,template=pop)

 # Note: out of the 792 known totals in pop, only non-zero entries are actually
 # relevant

 # 4) Lastly calibrate, e.g. with the unbounded linear distance and
 #    heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))

# Note: a global calibration task would have led to identical calibrated
# weights, but in a more memory-hungry and time-consuming way, as you can
# verify:
  # 1) Build template:
pop.g<-pop.template(sbsdes,
       calmodel=~(emp.num+ent):(nace2:region + emp.cl:nace.macro:region)-1)
dim(pop.g)

  # 2) Fill template:
pop.g <- fill.template(sbs.frame,pop.g)

  # 3) Calibrate globally:
## Not run: 
sbscal.g<-e.calibrate(sbsdes,pop.g,sigma2=~emp.num,bounds=c(-1E6,1E6))

  # 4) Compare calibrated weights (factorized vs. global solution):
range(weights(sbscal)/weights(sbscal.g))

  # ... they are equal.

## End(Not run)


###########################################################
# Just a single example of the memory-efficient algorithm #
# triggered by argument 'mem.frac'.                       #
###########################################################
## Not run: 
 # First artificially increase the size of the sampling frame (e.g.
 # up to 5 million rows):
sbs.frame.HUGE<-sbs.frame[sample(1:nrow(sbs.frame),5000000,rep=TRUE),]
dim(sbs.frame.HUGE)
 
 # Build the template:
pop<-pop.template(sbsdes,
     calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
     partition=~nace.macro:region)
dim(pop)

 # Fill the template by using the HUGE universe:
pop<-fill.template(universe=sbs.frame.HUGE,template=pop)

## End(Not run)

DiegoZardetto/ReGenesees documentation built on Dec. 16, 2024, 2:03 p.m.