prep.calBeta: Calibration on Multiple Regression Coefficients
In DiegoZardetto/ReGenesees: R Evolved Generalized Software for Sampling Estimates and Errors in Surveys

prep.calBeta

R Documentation

Calibration on Multiple Regression Coefficients

Description

Prepare survey data and control totals to run a calibration task on multiple regression coefficients.

Usage

prep.calBeta(design, model, Beta,
             by = NULL, partition = FALSE, drop.z = TRUE)

pop.calBeta(design)

Arguments

`design`	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata. For function `pop.calBeta`, a prepared design object as returned by function `prep.calBeta`.
`model`	Formula (or list of formulas) specifying the linear model(s) whose regression coefficients will be used as calibration benchmarks (see ‘Details’).
`Beta`	Numeric vector (or list of numeric vectors) of regression coefficients to be used as calibration benchmarks (see ‘Details’).
`by`	If regression coefficients are known at domain level (i.e. within subpopulations), a formula specifying the domains (see ‘Details’). Specify `NULL` (the default option) if the regression coefficients are known at the overall population level.
`partition`	In case domains have been specified through argument `by`, should a partitioned calibration task be performed? The default is `FALSE`, which selects a global calibration task.
`drop.z`	Can the prepared calibration variables (see ‘Details’) be dropped upon completion of the calibration task? Specify `TRUE` (the default) if you want to save memory space.

Details

Special Purpose Calibration tasks
ReGenesees 2.1 introduced support for ‘special purpose calibration’ tasks, i.e. facilities to calibrate survey weights so as to match complex population parameters, instead of ordinary population totals.

When calibration benchmarks come in the form of population parameters that are complex, non-linear functions of auxiliary variables (like, in the present case, multiple regression coefficients), calibration constraints have to be linearized first. This generates synthetic linearized variables z, which are the actual calibration variables the calibration algorithm will eventually work on. Typically, control totals for these synthetic linearized variables z will all be zero. See, e.g. [Lesage, 2011].

Put briefly:

Function prep.calBeta generates and binds to design the synthetic linearized variables z needed for the calibration task.
Then, given the prepared design object returned by prep.calBeta, function pop.calBeta generates the control totals data frame for those z variables.

Of course, once prepared, survey data and control totals returned by the above functions will be fed in input to function e.calibrate, which will run the calibration task.

Function prep.calBeta()
Function prep.calBeta makes it possible to:

Calibrate on regression coefficients of different linear models, each known at the overall population level.
Calibrate on regression coefficients of a single linear model, known possibly for different subpopulations.

Note that, as detailed below, the key argument to switch from case 1. to case 2. is by. For extensive illustration of both the above cases 1. and 2., please see the ‘Examples’ section.

The mandatory argument model specifies the linear model(s) whose regression coefficients will be used as calibration benchmarks. Use a formula object to specify a single linear model, or a list of formula objects to specify several different linear models. For details on model specification, see e.g. lm.
The design variables referenced by model formula(s) must be numeric or factor and must not contain any missing value (NA).

The mandatory argument Beta specifies the vector(s) of regression coefficients that will be used as calibration benchmarks. Use a numeric vector to specify regression coefficients of a single linear model, or a list of numeric vectors to specify regression coefficients of several different linear models. The Beta vector(s) must not contain any missing value (NA).

If argument model is passed as a list, argument Beta must be passed as a list too, and the length of both must be the same. If this is the case, Beta vectors will be positionally tied to linear model formulas contained in model.

Each vector of regression coefficients specified through Beta must be consistent with the corresponding model, namely match its model matrix columns (as generated using actual design data). Function prep.calBeta will check for this consistency and raise an informative error message in case of failure.
Note that, in order to ensure consistency with model, not only the length, but also the order of Beta elements matters. If in doubt, you can easily learn about the right ordering of Beta coefficients, given model, by calling function svystatB as follows: svystatB(design, model). This will, of course, return the estimated regression coefficients of model before calibration.
Note that each Beta vector can have names or not. If a Beta vector has names that match the expected ones (given the corresponding model formula), but appear in a different order, then function prep.calBeta will suitably re-order them and inform you with a warning message.

As anticipated, argument by can be used to enable calibration on regression coefficients known at domain level (i.e. within subpopulations).

If passed, by must be a formula: for example, the statement by=~B1:B2 defines as domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced. Note that, to prevent obvious collinearity issues, the variables referenced by argument by must not appear in the input model formula: otherwise, the program will stop and print an error message.

If you specify domains through argument by, you will be allowed to specify just a single linear model through argument model. Instead, through argument Beta, you will have to specify domain-level vectors of regression coefficients. Therefore, Beta will be necessarily passed as a list. Note that, in this case, the Beta list must have as many components as the domains defined through argument by, and the two will be matched positionally by function prep.calBeta. Therefore, the order of elements in the Beta list matters. Specifically, Beta vectors must appear in the same order as the domains specified by argument by. You can easily learn about the right ordering of by domains (and hence Beta elements) by calling function svystatB as follows: svystatB(design, model, by). This will, of course, return the estimated regression coefficients of model within by domains before calibration.

Lastly, if you specify domains through argument by, you will be allowed to decide whether design should be prepared for a global or a partitioned calibration task (see e.calibrate). Recall that partitioned calibration tasks are computationally more efficient, but produce exactly the same results. Note that, if you select partition = TRUE, than the calibration domains used to split the global calibration task will be the same domains specified via argument by. More explicitly: function e.calibrate will eventually be called on prep.calBeta's output object silently assuming partition = by.

The synthetic linearized variables z prepared by prep.calBeta will eventually be used by function e.calibrate to solve the calibration task. Argument drop.z allows you to instruct e.calibrate to drop these variables from its output object - or rather keep them within it - upon completion of the calibration task. The default has been set to TRUE to reduce memory usage. In case you want to be able to consistently trim weights after calibration via function trimcal, you must specify drop.z = FALSE.

Function pop.calBeta()
Given the prepared design object returned by prep.calBeta, function pop.calBeta generates the control totals data frame needed to run the calibration task. This dataframe suitably accomodates the control totals of the synthetic linearized variables z prepared by prep.calBeta.

Note that the data frame object returned by pop.calBeta is already filled, with control totals that are all zero, and it is ready to be directly passed to function e.calibrate (see the ‘Examples’ section).

Note, lastly, that printing this control totals data frame might not be very telling: to better understand its structure you should instead leverage function pop.desc, for which a method dedicated to ‘special purpose calibration’ tasks is available (see the ‘Examples’ section).

Value

For function pop.calBeta, an object of the same class as design, storing the freshly created synthetic linearized variables z as columns of its $variables slot (see the ‘Examples’ section).
For function pop.calBeta, a control totals data frame for those z variables, with class spc.pop (see the ‘Examples’ section).

Linear Algebra Remark

Contrary to ordinary calibration tasks, the control totals of a ‘special purpose calibration’ task are typically all zero. Therefore, calibration constraints of such tasks - unlike ordinary ones - define a system of linear equations that is homogeneous. Homogeneous systems always admit the trivial zero solution, which, in calibration terms, would mean output weights that are identically zero (and thus statistically unacceptable). For this reason, the possibility that function e.calibrate ends up with a false convergence of the special purpose calibration task cannot, in general, be ruled out. Note that functions e.calibrate and check.cal are able to detect such false convergence events and warn the user about it.

A satisfactory countermeasure to this issue is to set calibration bounds to any interval that does not include zero (see argument bounds of e.calibrate).
A second, very attractive alternative would be to run the special purpose calibration task jointly with some ordinary calibration task, as the joint calibration constraints would then define a non-homogeneous system. This solution can be obtained straightforwardly using function pop.fuse.

Author(s)

Diego Zardetto

References

Lesage, E. (2011). “The use of estimating equations to perform a calibration on complex parameters”. Survey Methodology. 37.

Examples

# Load sbs data:
data(sbs)
# Create a design object:
sbsdes <- e.svydesign(data = sbs, ids = ~id, strata = ~strata,
                      weights = ~weight, fpc = ~fpc)

#############################################################################
# Calibrate on the regression coefficients of a *single model* known at the #
# *overall population level*                                                #
#############################################################################
# Suppose you know the coefficients of the following linear model, which you
# obtained fitting the model on some external source (e.g. Census or register
# data):
model0 <- y ~ emp.num*emp.cl

# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external source and compute the values of the regression coefficients:
B0 <- coef(lm(model0, data = sbs.frame))
B0

# Now, prepare the survey design for calibration:
sbsdes0 <- prep.calBeta(design = sbsdes, model = model0, Beta = B0)

# Have a look at the freshly created *synthetic* auxiliary variables: 
head(sbsdes0$variables)

# Then, prepare the control totals dataframe for the calibration task:
pop0 <- pop.calBeta(sbsdes0)

# Have a look...
class(pop0)
pop.desc(pop0)
# ...and note that all the control totals are zero!
pop0

# Lastly, calibrate:
sbscal0 <- e.calibrate(sbsdes0, pop0)

# Check that calibration estimates of regression coefficients now match the
# input B0 values derived from the external source:
svystatB(sbscal0, model0)
B0
# OK


#########################################################################
# Calibrate simultaneously on the regression coefficients of *different #
# models* known at the *overall population level*                       #
#########################################################################
# Suppose you know the coefficients of the following linear models, which you
# obtained fitting the models on some external sources
model1 <- va.imp2 ~ emp.num:emp.cl + nace.macro - 1
model2 <- y ~ dom3 - 1

# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external sources and compute the values of the regression coefficients:
B1 <- coef(lm(model1, data = sbs.frame))
B2 <- coef(lm(model2, data = sbs.frame))

  ## First, just for illustration, calibrate only on B1 regression coefficients:
  sbsdes1 <- prep.calBeta(sbsdes, model = model1, Beta = B1)
  pop1 <- pop.calBeta(sbsdes1)
  sbscal1 <- e.calibrate(sbsdes1, pop1)

  # Check that calibration estimates of regression coefficients now match the
  # input B1 values derived from the external source...
  svystatB(sbscal1, model1)
  B1
  # ...but, of course, do *not* match those of B2:
  svystatB(sbscal1, model2)
  B2
  # OK

  ## Second, just for illustration, calibrate only on B2 regression coefficients:
  sbsdes2 <- prep.calBeta(sbsdes, model = model2, Beta = B2)
  pop2 <- pop.calBeta(sbsdes2)
  sbscal2 <- e.calibrate(sbsdes2, pop2)

  # Check that calibration estimates of regression coefficients now match the
  # input B2 values derived from the external source...
  svystatB(sbscal2, model2)
  B2
  # ...but, of course, do *not* match those of B1:
  svystatB(sbscal2, model1)
  B1
  # OK

## Now, calibrate *simultaneously* on *B1 and B2* regression coefficients
# Prepare the survey design for the joint calibration task:
sbsdes1_2 <- prep.calBeta(sbsdes, model = list(model1, model2), Beta = list(B1, B2))

# Prepare the control totals dataframe for the joint calibration task:
pop1_2 <- pop.calBeta(sbsdes1_2)

# Have a look to the control totals (note the presence of two models and
# Beta vectors):
pop.desc(pop1_2)
pop1_2

# Lastly, run the calibration:
sbscal1_2 <- e.calibrate(sbsdes1_2, pop1_2)

# Check that calibration estimates of regression coefficients now match *both*
# the B1 and B2 values derived from the external sources:
svystatB(sbscal1_2, model1)
B1
svystatB(sbscal1_2, model2)
B2
# OK


###############################################################################
# Calibrate simultaneously on the regression coefficients of a *single model* #
# known for *different subpopulations*                                        #
###############################################################################
# NOTE: In this case, both *global* and *partitioned* calibration tasks are
#       possible, and both will be illustrated below.

# Suppose you know the coefficients of the following linear model, which you
# obtained fitting the model on some external source (e.g. Census or register
# data) *within subpopulations* defined by some factor variable(s):
model <- va.imp2 ~ emp.num:emp.cl + nace.macro - 1

# Here, use the sbs sampling frame available in ReGenesees to simulate the
# external source and suppose the subpopulations are defined by variable 'dom3'.
# Thus, compute the values of the regression coefficients as follows:
B <- by(sbs.frame, sbs.frame$dom3, function(df) coef(lm(model, data = df)))
B

## Let's start with the *global* solution
# Prepare the survey design for the calibration task (note that the 'by'
# argument is used):
sbsdes.g <- prep.calBeta(sbsdes, model, Beta = B, by = ~dom3)

# Prepare the control totals for the calibration task:
pop.g <- pop.calBeta(sbsdes.g)

# Have a look to the control totals (note the presence of one Beta vector *for
# each domain*):
pop.desc(pop.g)
pop.g

# Run the calibration:
sbscal.g <- e.calibrate(sbsdes.g, pop.g)

# Check that calibration estimates of regression coefficients now match the
# input B values derived from the external source *for each domain*:
svystatB(sbscal.g, model, ~dom3)
B
# OK

## Let's proceed with the *partitioned* solution
# Prepare the survey design for the calibration task (note that 'by' and
# 'partition' arguments are used):
sbsdes.p <- prep.calBeta(sbsdes, model, Beta = B, by = ~dom3, partition = TRUE)

# Prepare the control totals for the calibration task:
pop.p <- pop.calBeta(sbsdes.p)

# Have a look to the control totals (note the presence of one Beta vector *for
# each domain*):
pop.desc(pop.p)
pop.p

# Run the calibration:
sbscal.p <- e.calibrate(sbsdes.p, pop.p)

# Check that calibration estimates of regression coefficients now match the
# input B values derived from the external source *for each domain*:
svystatB(sbscal.p, model, ~dom3)
B
# OK

## Lastly, check that calibration weights obtained using the *global* and 
## *partitioned* solution are the same:
g.range(sbscal.g)
g.range(sbscal.p)
all.equal(weights(sbscal.g), weights(sbscal.p))
# OK


###########################################################
# BONUS TIP: Calibration on the mean (or on domain means) #
#            of one variable or multiple variables.       #
###########################################################
# Since the domain mean of a numeric variable can be thought as a
# regression coefficient (see the 'Examples' section of ?svystatB),
# you can use the ReGenesees facilities documented above to
# *calibrate on the mean (or on domain means)* of *one variable
# or multiple variables*.
# NOTE: The examples below cover the following cases of calibration on:
#       (i)   The overall mean of a single variable.
#       (ii)  The means of a single variable within domains of just
#             one type.
#       (iii) The domain means of several variables, with multiple
#             and different domain types.

# Load artificial household survey data and define a survey design:
data(data.examples)
exdes <- e.svydesign(data = example, ids = ~towcod + famcod,
                     strata = ~SUPERSTRATUM, weights = ~weight)
exdes <- des.addvars(exdes, ones = 1)

## CASE (i): Calibrate on the overall mean of a single variable
# Suppose you know with satisfactory accuracy the *average income* of your
# target population (but you do *not* have reliable information on the
# *total income*, nor on the *total number of individuals*):
income.AVG <- 1270

# You can calibrate on *average income* as follows:
exdes.new <- prep.calBeta(exdes, income ~ 1, Beta = income.AVG)
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)

# Now, check that calibration estimate of average income now match the known
# value derived from the external source *without residual uncertainty*:
svystatTM(excal.new, ~income, estimator = "Mean")
income.AVG

# ...while there is *residual uncertainty* in the estimates of the numerator and
# denominator totals:
svystatTM(excal.new, ~income + ones, estimator = "Total")
# OK

## CASE (ii): Calibrate on the means of a single variable within domains
#             of just one kind
# You can calibrate on *domain means* along the lines illustrated above (note,
# however, that argument 'by' would provide an alternative way to achieve
# the same result).
# Suppose you know with satisfactory accuracy the *average income* by the
# crossclassification of sex and marital status:
income.AVG.sex.marstat <- c(f.married   = 1310, m.married   = 1260,
                            f.unmarried = 1150, m.unmarried = 1200,
                            f.widowed   = 1380, m.widowed   = 1300)

# Run the calibration on *average income by sex:marstat* as follows:
exdes.new <- prep.calBeta(exdes, income ~ sex:marstat -1,
                          Beta = income.AVG.sex.marstat)
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)

# Now, check that calibration estimates of average income by domains now match
# the known values derived from the external source *without residual
# uncertainty*:
svystatTM(excal.new, ~income, ~sex:marstat, estimator = "Mean")
income.AVG.sex.marstat

# ...while there is *residual uncertainty* in the estimates of the numerator and
# denominator totals:
svystatTM(excal.new, ~income + ones, ~sex:marstat, estimator = "Total")
# OK

## CASE (iii): Calibrate on the domain means of several variables, with multiple
#              and different domain types.
# Suppose you know with satisfactory accuracy:
# - the average income by sex:
# - the average income by marstat:
# - the average of variable z by age (variable 'age5c', 5 classes):
income.AVG.sex <- c("f" = 1245, "m" = 1250)
income.AVG.marstat <- c("married" = 1260, "unmarried" = 1230, "widowed" = 1290)
z.AVG.age5c <- c("1" = 125, "2" = 130, "3" = 135, "4" = 125, "5" = 140)

# Run the calibration as follows:
exdes.new <- prep.calBeta(exdes, model = list(income ~ sex -1,
                                              income ~ marstat -1,
                                              z ~ age5c -1),
                                 Beta = list(income.AVG.sex,
                                             income.AVG.marstat,
                                             z.AVG.age5c)
                          )
pop.new <- pop.calBeta(exdes.new)
excal.new <- e.calibrate(exdes.new, pop.new)

# Now, check that calibration estimates match the known domain means derived
# from the external source:
svystatTM(excal.new, ~income, ~sex, estimator = "Mean")
income.AVG.sex
svystatTM(excal.new, ~income, ~marstat, estimator = "Mean")
income.AVG.marstat
svystatTM(excal.new, ~z, ~age5c, estimator = "Mean")
z.AVG.age5c
# OK

DiegoZardetto/ReGenesees documentation built on Dec. 16, 2024, 2:03 p.m.