prep.calBeta | R Documentation |
Prepare survey data and control totals to run a calibration task on multiple regression coefficients.
prep.calBeta(design, model, Beta,
by = NULL, partition = FALSE, drop.z = TRUE)
pop.calBeta(design)
design |
Object of class |
model |
Formula (or list of formulas) specifying the linear model(s) whose regression coefficients will be used as calibration benchmarks (see ‘Details’). |
Beta |
Numeric vector (or list of numeric vectors) of regression coefficients to be used as calibration benchmarks (see ‘Details’). |
by |
If regression coefficients are known at domain level (i.e. within subpopulations), a formula specifying the domains (see ‘Details’). Specify |
partition |
In case domains have been specified through argument |
drop.z |
Can the prepared calibration variables (see ‘Details’) be dropped upon completion of the calibration task? Specify |
Special Purpose Calibration tasks
ReGenesees 2.1 introduced support for ‘special purpose calibration’ tasks, i.e. facilities to calibrate survey weights so as to match complex population parameters, instead of ordinary population totals.
When calibration benchmarks come in the form of population parameters that are complex, non-linear functions of auxiliary variables (like, in the present case, multiple regression coefficients), calibration constraints have to be linearized first. This generates synthetic linearized variables z, which are the actual calibration variables the calibration algorithm will eventually work on. Typically, control totals for these synthetic linearized variables z will all be zero. See, e.g. [Lesage, 2011].
Put briefly:
Function prep.calBeta
generates and binds to design
the synthetic linearized variables z needed for the calibration task.
Then, given the prepared design
object returned by prep.calBeta
, function pop.calBeta
generates the control totals data frame for those z variables.
Of course, once prepared, survey data and control totals returned by the above functions will be fed in input to function e.calibrate
, which will run the calibration task.
Function prep.calBeta()
Function prep.calBeta
makes it possible to:
Calibrate on regression coefficients of different linear models, each known at the overall population level.
Calibrate on regression coefficients of a single linear model, known possibly for different subpopulations.
Note that, as detailed below, the key argument to switch from case 1. to case 2. is by
. For extensive illustration of both the above cases 1. and 2., please see the ‘Examples’ section.
The mandatory argument model
specifies the linear model(s) whose regression coefficients will be used as calibration benchmarks. Use a formula
object to specify a single linear model, or a list
of formula
objects to specify several different linear models. For details on model specification, see e.g. lm
.
The design
variables referenced by model
formula(s) must be numeric
or factor
and must not contain any missing value (NA
).
The mandatory argument Beta
specifies the vector(s) of regression coefficients that will be used as calibration benchmarks. Use a numeric
vector to specify regression coefficients of a single linear model, or a list
of numeric
vectors to specify regression coefficients of several different linear models. The Beta
vector(s) must not contain any missing value (NA
).
If argument model
is passed as a list, argument Beta
must be passed as a list too, and the length
of both must be the same. If this is the case, Beta
vectors will be positionally tied to linear model formulas contained in model
.
Each vector of regression coefficients specified through Beta
must be consistent with the corresponding model
, namely match its model matrix columns (as generated using actual design
data). Function prep.calBeta
will check for this consistency and raise an informative error message in case of failure.
Note that, in order to ensure consistency with model
, not only the length
, but also the order of Beta
elements matters. If in doubt, you can easily learn about the right ordering of Beta
coefficients, given model
, by calling function svystatB
as follows: svystatB(design, model)
. This will, of course, return the estimated regression coefficients of model
before calibration.
Note that each Beta
vector can have names
or not. If a Beta
vector has names that match the expected ones (given the corresponding model
formula), but appear in a different order, then function prep.calBeta
will suitably re-order them and inform you with a warning
message.
As anticipated, argument by
can be used to enable calibration on regression coefficients known at domain level (i.e. within subpopulations).
If passed, by
must be a formula: for example, the statement by=~B1:B2
defines as domains the subpopulations determined by crossing the modalities of variables B1
and B2
. Notice that a formula like by=~B1+B2
will be automatically translated into the factor-crossing formula by=~B1:B2
. The design
variables referenced by by
(if any) should be of type factor
, otherwise they will be coerced. Note that, to prevent obvious collinearity issues, the variables referenced by argument by
must not appear in the input model
formula: otherwise, the program will stop and print an error message.
If you specify domains through argument by
, you will be allowed to specify just a single linear model through argument model
. Instead, through argument Beta
, you will have to specify domain-level vectors of regression coefficients. Therefore, Beta
will be necessarily passed as a list. Note that, in this case, the Beta
list must have as many components as the domains defined through argument by
, and the two will be matched positionally by function prep.calBeta
. Therefore, the order of elements in the Beta
list matters. Specifically, Beta
vectors must appear in the same order as the domains specified by argument by
. You can easily learn about the right ordering of by
domains (and hence Beta
elements) by calling function svystatB
as follows: svystatB(design, model, by)
. This will, of course, return the estimated regression coefficients of model
within by
domains before calibration.
Lastly, if you specify domains through argument by
, you will be allowed to decide whether design
should be prepared for a global or a partitioned calibration task (see e.calibrate
). Recall that partitioned calibration tasks are computationally more efficient, but produce exactly the same results. Note that, if you select partition = TRUE
, than the calibration domains used to split the global calibration task will be the same domains specified via argument by
. More explicitly: function e.calibrate
will eventually be called on prep.calBeta
's output object silently assuming partition = by
.
The synthetic linearized variables z prepared by prep.calBeta
will eventually be used by function e.calibrate
to solve the calibration task. Argument drop.z
allows you to instruct e.calibrate
to drop these variables from its output object - or rather keep them within it - upon completion of the calibration task. The default has been set to TRUE
to reduce memory usage. In case you want to be able to consistently trim weights after calibration via function trimcal
, you must specify drop.z = FALSE
.
Function pop.calBeta()
Given the prepared design
object returned by prep.calBeta
, function pop.calBeta
generates the control totals data frame needed to run the calibration task. This dataframe suitably accomodates the control totals of the synthetic linearized variables z prepared by prep.calBeta
.
Note that the data frame object returned by pop.calBeta
is already filled, with control totals that are all zero, and it is ready to be directly passed to function e.calibrate
(see the ‘Examples’ section).
Note, lastly, that printing this control totals data frame might not be very telling: to better understand its structure you should instead leverage function pop.desc
, for which a method dedicated to ‘special purpose calibration’ tasks is available (see the ‘Examples’ section).
For function pop.calBeta
, an object of the same class as design
, storing the freshly created synthetic linearized variables z as columns of its $variables
slot (see the ‘Examples’ section).
For function pop.calBeta
, a control totals data frame for those z variables, with class spc.pop
(see the ‘Examples’ section).
Contrary to ordinary calibration tasks, the control totals of a ‘special purpose calibration’ task are typically all zero. Therefore, calibration constraints of such tasks - unlike ordinary ones - define a system of linear equations that is homogeneous. Homogeneous systems always admit the trivial zero solution, which, in calibration terms, would mean output weights that are identically zero (and thus statistically unacceptable). For this reason, the possibility that function e.calibrate
ends up with a false convergence of the special purpose calibration task cannot, in general, be ruled out. Note that functions e.calibrate
and check.cal
are able to detect such false convergence events and warn the user about it.
A satisfactory countermeasure to this issue is to set calibration bounds to any interval that does not include zero (see argument bounds
of e.calibrate
).
A second, very attractive alternative would be to run the special purpose calibration task jointly with some ordinary calibration task, as the joint calibration constraints would then define a non-homogeneous system. This solution can be obtained straightforwardly using function pop.fuse
.
Diego Zardetto
Lesage, E. (2011). “The use of estimating equations to perform a calibration on complex parameters”. Survey Methodology. 37.
e.calibrate
to calibrate weights, svystatB
to compute estimates and sampling errors of Multiple Regression Coefficients, pop.desc
to obtain a natural language description of control totals (including those for special purpose calibration tasks), pop.fuse
to fuse population totals prepared for ordinary and special purpose calibration tasks.
# Load sbs data:
data(sbs)
# Create a design object:
sbsdes <- e.svydesign(data = sbs, ids = ~id, strata = ~strata,
weights = ~weight, fpc = ~fpc)
#############################################################################
# Calibrate on the regression coefficients of a *single model* known at the #
# *overall population level* #
#############################################################################
# Suppose you know the coefficients of the following linear model, which you
# obtained fitting the model on some external source (e.g. Census or register
# data):
model0 <- y ~ emp.num*emp.cl
# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external source and compute the values of the regression coefficients:
B0 <- coef(lm(model0, data = sbs.frame))
B0
# Now, prepare the survey design for calibration:
sbsdes0 <- prep.calBeta(design = sbsdes, model = model0, Beta = B0)
# Have a look at the freshly created *synthetic* auxiliary variables:
head(sbsdes0$variables)
# Then, prepare the control totals dataframe for the calibration task:
pop0 <- pop.calBeta(sbsdes0)
# Have a look...
class(pop0)
pop.desc(pop0)
# ...and note that all the control totals are zero!
pop0
# Lastly, calibrate:
sbscal0 <- e.calibrate(sbsdes0, pop0)
# Check that calibration estimates of regression coefficients now match the
# input B0 values derived from the external source:
svystatB(sbscal0, model0)
B0
# OK
#########################################################################
# Calibrate simultaneously on the regression coefficients of *different #
# models* known at the *overall population level* #
#########################################################################
# Suppose you know the coefficients of the following linear models, which you
# obtained fitting the models on some external sources
model1 <- va.imp2 ~ emp.num:emp.cl + nace.macro - 1
model2 <- y ~ dom3 - 1
# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external sources and compute the values of the regression coefficients:
B1 <- coef(lm(model1, data = sbs.frame))
B2 <- coef(lm(model2, data = sbs.frame))
## First, just for illustration, calibrate only on B1 regression coefficients:
sbsdes1 <- prep.calBeta(sbsdes, model = model1, Beta = B1)
pop1 <- pop.calBeta(sbsdes1)
sbscal1 <- e.calibrate(sbsdes1, pop1)
# Check that calibration estimates of regression coefficients now match the
# input B1 values derived from the external source...
svystatB(sbscal1, model1)
B1
# ...but, of course, do *not* match those of B2:
svystatB(sbscal1, model2)
B2
# OK
## Second, just for illustration, calibrate only on B2 regression coefficients:
sbsdes2 <- prep.calBeta(sbsdes, model = model2, Beta = B2)
pop2 <- pop.calBeta(sbsdes2)
sbscal2 <- e.calibrate(sbsdes2, pop2)
# Check that calibration estimates of regression coefficients now match the
# input B2 values derived from the external source...
svystatB(sbscal2, model2)
B2
# ...but, of course, do *not* match those of B1:
svystatB(sbscal2, model1)
B1
# OK
## Now, calibrate *simultaneously* on *B1 and B2* regression coefficients
# Prepare the survey design for the joint calibration task:
sbsdes1_2 <- prep.calBeta(sbsdes, model = list(model1, model2), Beta = list(B1, B2))
# Prepare the control totals dataframe for the joint calibration task:
pop1_2 <- pop.calBeta(sbsdes1_2)
# Have a look to the control totals (note the presence of two models and
# Beta vectors):
pop.desc(pop1_2)
pop1_2
# Lastly, run the calibration:
sbscal1_2 <- e.calibrate(sbsdes1_2, pop1_2)
# Check that calibration estimates of regression coefficients now match *both*
# the B1 and B2 values derived from the external sources:
svystatB(sbscal1_2, model1)
B1
svystatB(sbscal1_2, model2)
B2
# OK
###############################################################################
# Calibrate simultaneously on the regression coefficients of a *single model* #
# known for *different subpopulations* #
###############################################################################
# NOTE: In this case, both *global* and *partitioned* calibration tasks are
# possible, and both will be illustrated below.
# Suppose you know the coefficients of the following linear model, which you
# obtained fitting the model on some external source (e.g. Census or register
# data) *within subpopulations* defined by some factor variable(s):
model <- va.imp2 ~ emp.num:emp.cl + nace.macro - 1
# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external source and suppose the subpopulations are defined by variable 'dom3'.
# Thus, compute the values of the regression coefficients as follows:
B <- by(sbs.frame, sbs.frame$dom3, function(df) coef(lm(model, data = df)))
B
## Let's start with the *global* solution
# Prepare the survey design for the calibration task (note that the 'by'
# argument is used):
sbsdes.g <- prep.calBeta(sbsdes, model, Beta = B, by = ~dom3)
# Prepare the control totals for the calibration task:
pop.g <- pop.calBeta(sbsdes.g)
# Have a look to the control totals (note the presence of one Beta vector *for
# each domain*):
pop.desc(pop.g)
pop.g
# Run the calibration:
sbscal.g <- e.calibrate(sbsdes.g, pop.g)
# Check that calibration estimates of regression coefficients now match the
# input B values derived from the external source *for each domain*:
svystatB(sbscal.g, model, ~dom3)
B
# OK
## Let's proceed with the *partitioned* solution
# Prepare the survey design for the calibration task (note that 'by' and
# 'partition' arguments are used):
sbsdes.p <- prep.calBeta(sbsdes, model, Beta = B, by = ~dom3, partition = TRUE)
# Prepare the control totals for the calibration task:
pop.p <- pop.calBeta(sbsdes.p)
# Have a look to the control totals (note the presence of one Beta vector *for
# each domain*):
pop.desc(pop.p)
pop.p
# Run the calibration:
sbscal.p <- e.calibrate(sbsdes.p, pop.p)
# Check that calibration estimates of regression coefficients now match the
# input B values derived from the external source *for each domain*:
svystatB(sbscal.p, model, ~dom3)
B
# OK
## Lastly, check that calibration weights obtained using the *global* and
## *partitioned* solution are the same:
g.range(sbscal.g)
g.range(sbscal.p)
all.equal(weights(sbscal.g), weights(sbscal.p))
# OK
###########################################################
# BONUS TIP: Calibration on the mean (or on domain means) #
# of one variable or multiple variables. #
###########################################################
# Since the domain mean of a numeric variable can be thought as a
# regression coefficient (see the 'Examples' section of ?svystatB),
# you can use the ReGenesees facilities documented above to
# *calibrate on the mean (or on domain means)* of *one variable
# or multiple variables*.
# NOTE: The examples below cover the following cases of calibration on:
# (i) The overall mean of a single variable.
# (ii) The means of a single variable within domains of just
# one type.
# (iii) The domain means of several variables, with multiple
# and different domain types.
# Load artificial household survey data and define a survey design:
data(data.examples)
exdes <- e.svydesign(data = example, ids = ~towcod + famcod,
strata = ~SUPERSTRATUM, weights = ~weight)
exdes <- des.addvars(exdes, ones = 1)
## CASE (i): Calibrate on the overall mean of a single variable
# Suppose you know with satisfactory accuracy the *average income* of your
# target population (but you do *not* have reliable information on the
# *total income*, nor on the *total number of individuals*):
income.AVG <- 1270
# You can calibrate on *average income* as follows:
exdes.new <- prep.calBeta(exdes, income ~ 1, Beta = income.AVG)
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)
# Now, check that calibration estimate of average income now match the known
# value derived from the external source *without residual uncertainty*:
svystatTM(excal.new, ~income, estimator = "Mean")
income.AVG
# ...while there is *residual uncertainty* in the estimates of the numerator and
# denominator totals:
svystatTM(excal.new, ~income + ones, estimator = "Total")
# OK
## CASE (ii): Calibrate on the means of a single variable within domains
# of just one kind
# You can calibrate on *domain means* along the lines illustrated above (note,
# however, that argument 'by' would provide an alternative way to achieve
# the same result).
# Suppose you know with satisfactory accuracy the *average income* by the
# crossclassification of sex and marital status:
income.AVG.sex.marstat <- c(f.married = 1310, m.married = 1260,
f.unmarried = 1150, m.unmarried = 1200,
f.widowed = 1380, m.widowed = 1300)
# Run the calibration on *average income by sex:marstat* as follows:
exdes.new <- prep.calBeta(exdes, income ~ sex:marstat -1,
Beta = income.AVG.sex.marstat)
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)
# Now, check that calibration estimates of average income by domains now match
# the known values derived from the external source *without residual
# uncertainty*:
svystatTM(excal.new, ~income, ~sex:marstat, estimator = "Mean")
income.AVG.sex.marstat
# ...while there is *residual uncertainty* in the estimates of the numerator and
# denominator totals:
svystatTM(excal.new, ~income + ones, ~sex:marstat, estimator = "Total")
# OK
## CASE (iii): Calibrate on the domain means of several variables, with multiple
# and different domain types.
# Suppose you know with satisfactory accuracy:
# - the average income by sex:
# - the average income by marstat:
# - the average of variable z by age (variable 'age5c', 5 classes):
income.AVG.sex <- c("f" = 1245, "m" = 1250)
income.AVG.marstat <- c("married" = 1260, "unmarried" = 1230, "widowed" = 1290)
z.AVG.age5c <- c("1" = 125, "2" = 130, "3" = 135, "4" = 125, "5" = 140)
# Run the calibration as follows:
exdes.new <- prep.calBeta(exdes, model = list(income ~ sex -1,
income ~ marstat -1,
z ~ age5c -1),
Beta = list(income.AVG.sex,
income.AVG.marstat,
z.AVG.age5c)
)
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)
# Now, check that calibration estimates match the known domain means derived
# from the external source:
svystatTM(excal.new, ~income, ~sex, estimator = "Mean")
income.AVG.sex
svystatTM(excal.new, ~income, ~marstat, estimator = "Mean")
income.AVG.marstat
svystatTM(excal.new, ~z, ~age5c, estimator = "Mean")
z.AVG.age5c
# OK
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.