fill.template | R Documentation |
Given a template prepared to store the totals of the auxiliary variables for a specific calibration task, computes the actual values of such totals from a sampling frame.
fill.template(universe, template, mem.frac = 10)
universe |
Data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables (the sampling frame). |
template |
The template for the calibration task, an object of class |
mem.frac |
A |
Recall that a template
object returned by function pop.template
has a structure that complies with the standard required by e.calibrate
, but is empty, in the sense that all the known totals it must be able to store are missing (NA
). Whenever these totals are available to the user as such, that is in the form of already computed aggregated values (e.g. because they come from an external source, like a Population Census), the ReGenesees package cannot automatically fill the template. Stated more explicitly: the user himself has to bear the responsibility of putting the right values in the right slots of the prepared template
data frame. To this end, function pop.desc
could be very helpful.
A lucky alternative arises when a “sampling frame” (that is a data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables) is available. In such cases, indeed, the fill.template
function is able to: (i) automatically compute the totals of the auxiliary variables from the universe
data frame, (ii) safely arrange and format these values according to the template
structure.
Notice that fill.template
will perform a complete coherence check between universe
and template
. If this check fails, the program stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem. Should empty levels be present in any factor variable belonging to universe
, they would be dropped.
Argument mem.frac
(whose value must be numeric and non-negative) triggers a memory-efficient algorithm when universe is really huge. The only sound reason to ever change the value of this argument from its default (mem.frac=10
) is that an invocation of fill.template
caused a memory-failure (i.e. a messages beginning cannot allocate vector of size ...
) on your machine. In such a case, increasing the value of mem.frac
(e.g. mem.frac=20
) will provide a better chance of succeeding (for more details, see ‘Performance’ section below).
An object of class pop.totals
storing the actual values of the population totals for the specified calibration task, ready to be safely passed to e.calibrate
.
Real-world calibration tasks (e.g. in the field of Official Statistics) can simultaneously involve hundreds of auxiliary variables and refer to target populations of several million units. In such circumstances, the naive aggregation of the calibration model.matrix
of universe
may turn out to be too memory-demanding (at least in ordinary PC environments) and determine a memory-failure error.
The alternative implemented in fill.template
is to: (i) split universe
in chunks, (ii) compute partial sums of auxiliary variables chunk-by-chunk, (iii) update template
by adding progressively such partial sums. This alternative is triggered by parameter mem.frac
, which also implicitly controls the number of chunks. The function estimates the memory that would be used to store the full model.matrix
of universe
and compares it to 4 GB: if the resulting ratio is bigger than 1/mem.frac
, the memory-efficient algorithm starts; the number of chunks in which universe
will then be split is determined in such a way that the memory needed to store the model.matrix
of each chunk does not exceed a fraction 1/mem.frac
of 4 GB.
Whenever fill.template
switches to the memory-efficient "chunking" algorithm, a warning message will signal it and will specify as well the number of chunks that are being processed.
Diego Zardetto
Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.
e.calibrate
for calibrating weights, pop.template
for the definition of the class pop.totals
and to build a "template" data frame for known population totals, pop.desc
to provide a natural language description of the template structure, and %into%
for the compression operator for nested factors.
# Load sbs data:
data(sbs)
# Build a design object:
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,fpc=~fpc)
###########################
# A simple example first. #
###########################
# Suppose you want to calibrate on the enterprise counts inside areas
# 1) Build the population totals template:
pop<-pop.template(sbsdes, calmodel=~area-1)
# Note: given the dimension of the obtained template...
dim(pop)
# ...the number of known totals to be stored is 24 (one for each area).
# 2) Use the fill.template function to (i) automatically compute
# such 24 totals from the universe (sbs.frame) and (ii) safely fill
# the template:
pop<-fill.template(universe=sbs.frame,template=pop)
pop
# 3) Lastly calibrate, e.g. with the unbounded linear distance and
# heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))
########################################
# A more involved (two-sided) example. #
########################################
# Now suppose you have to perform a calibration process which
# exploits as auxiliary information the total number of employees (emp.num)
# and enterprises (ent) inside the domains obtained by:
# i) crossing nace2 and region;
# ii) crossing emp.cl, region and nace.macro;
# Due to the fact that nace2 is nested into nace.macro,
# the calibration model can be efficiently factorized as follows:
## 1) Add to the design object and universe the new compressed
# factor variable involving nested factors, namely:
sbsdes<-des.addvars(sbsdes,nace2.in.nace.macro=nace2 %into% nace.macro)
sbs.frame$nace2.in.nace.macro<-sbs.frame$nace2 %into% sbs.frame$nace.macro
# 2) Build the template exploiting the new variable:
pop<-pop.template(sbsdes,
calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
partition=~nace.macro:region)
# Note: given the dimension of the obtained template...
dim(pop)
# ...the number of known totals to be stored is 792.
# 3) Use the fill.template function to (i) automatically compute
# such 792 totals from the universe (sbs.frame) and (ii) safely fill
# the template:
pop<-fill.template(universe=sbs.frame,template=pop)
# Note: out of the 792 known totals in pop, only non-zero entries are actually
# relevant
# 4) Lastly calibrate, e.g. with the unbounded linear distance and
# heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))
# Note: a global calibration task would have led to identical calibrated
# weights, but in a more memory-hungry and time-consuming way, as you can
# verify:
# 1) Build template:
pop.g<-pop.template(sbsdes,
calmodel=~(emp.num+ent):(nace2:region + emp.cl:nace.macro:region)-1)
dim(pop.g)
# 2) Fill template:
pop.g <- fill.template(sbs.frame,pop.g)
# 3) Calibrate globally:
## Not run:
sbscal.g<-e.calibrate(sbsdes,pop.g,sigma2=~emp.num,bounds=c(-1E6,1E6))
# 4) Compare calibrated weights (factorized vs. global solution):
range(weights(sbscal)/weights(sbscal.g))
# ... they are equal.
## End(Not run)
###########################################################
# Just a single example of the memory-efficient algorithm #
# triggered by argument 'mem.frac'. #
###########################################################
## Not run:
# First artificially increase the size of the sampling frame (e.g.
# up to 5 million rows):
sbs.frame.HUGE<-sbs.frame[sample(1:nrow(sbs.frame),5000000,rep=TRUE),]
dim(sbs.frame.HUGE)
# Build the template:
pop<-pop.template(sbsdes,
calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
partition=~nace.macro:region)
dim(pop)
# Fill the template by using the HUGE universe:
pop<-fill.template(universe=sbs.frame.HUGE,template=pop)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.