get.linvar | R Documentation |
Computes the linearized variable(s) of a Complex Estimator in subpopulations (domains). The Complex Estimator can be any analytic function of Horvitz-Thompson or Calibration estimators.
get.linvar(design, expr, by = NULL, stack = FALSE, na.rm = FALSE)
design |
Object of class |
expr |
R expression defining the Complex Estimator (see |
by |
Formula specifying the variables that define the "estimation domains". If |
stack |
If |
na.rm |
Allow missing values in the variables entering the estimator expression? The default is
|
This function has been designed mainly for programmers or advanced users willing to build upon ReGenesees: typical users are not expected to feel much need of it.
The ReGenesees package adopts the Taylor-series linearization technique to estimate the sampling variance of nonlinear (smooth) estimators. Using this technique, it can handle any Complex Estimator that can be expressed as an analytic function of Horvitz-Thompson or Calibration estimators. This includes even user-defined estimators, through function svystatL
[Zardetto, 15].
In the Taylor-series linearization approach, estimating the sampling variance of a Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson estimator of the total of the linearized variable of the Complex Estimator under the sampling design at hand. Function get.linvar
computes that linearized variable. If one is interested in estimating the sampling variance of a Complex Estimator by domains (i.e. for subpopulations), then - owing to the Taylor linearization technique - a different linearized variable has to be calculated for each domain. Therefore, in the most general case, the output of function get.linvar
will be a matrix
of linearized variables, with one column per domain.
If the input object design
is calibrated, then the linearized variable produced by function get.linvar
correctly takes into account the theoretical properties of Calibration estimators (see, e.g., [Deville, 1999]). In particular, get.linvar
computes the needed g-weighted residuals (see, e.g., equations (8) and (9) of [Zardetto, 15]) by leveraging function get.residuals
.
The mandatory argument expr
, which identifies the Complex Estimator, must be an object of class expression
. You can specify just a single Complex Estimator at a time, i.e. length(expr)
must be equal to 1
. Any analytic function of estimators of Totals and Means derived from the input design object design
is allowed.
The basic syntax follows the same rules described for function svystatL
. Inside expr
, the estimator of the Total of a variable is simply represented by the name of the variable itself. The reserved name ones
can be used to reference an artificial variable (which will be created on-the-fly, if not already present) whose value is 1
for each sampling unit, so that its Total estimator actually estimates the size of the population in terms of elementary units. Therefore, e.g., expression y/ones
represents the estimator of the Mean of variable y
. Variables referenced inside expr
must be numeric
and belong to design
.
The optional argument by
specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL
(the default option), the estimates produced by svystatL
refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2
selects as estimation domains the subpopulations determined by crossing the modalities of variables B1
and B2
. Notice that a formula like by=~B1+B2
will be automatically translated into the factor-crossing formula by=~B1:B2
: if you need to compute estimates for domains B1
and B2
separately, you have to call svystatL
twice. The design
variables referenced by by
(if any) should be of type factor
, otherwise they will be coerced.
As already noted, in the most general case, the output of function get.linvar
will be a matrix
of linearized variables, with one column per domain. However, the linearized variable of any domain estimator that is a function of Horvitz-Thompson estimators is identically equal to zero outside the domain. Thanks to this property, and only for functions of HT estimators, the columns of the output matrix can be stacked into one single column without information loss. Users may request this convenient output format by specifying stack = TRUE
. Unfortunately, the same result cannot be obtained for functions of Calibration estimators, because the g-weighted residuals are generally non-zero outside the active estimation domain. Therefore, argument stack
does nothing if design
is calibrated.
A matrix of linearized variables.
Diego Zardetto
Deville, J. C. (1999) “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.”. Survey Methodology 25: 193-203.
Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.
svystatL
to calculate estimates and sampling errors of Complex Estimators, get.residuals
to compute residuals of interest variables w.r.t. the calibration model adopted to build a calibrated object.
# Load sbs data:
data(sbs)
# Create a design object to be calibrated:
sbsdes<-e.svydesign(data= sbs, ids= ~id, strata= ~strata, weights= ~weight, fpc= ~fpc)
##################################################################
# Just some checks on the consistency of the numerical results #
# obtained by ReGenesees with well known theoretical properties. #
##################################################################
## In the Taylor-series linearization approach, estimating the sampling variance of a
## Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson
## estimator of the total of the linearized variable of the Complex Estimator under the
## sampling design at hand.
#############
## Check 1 ##
#############
# Global estimate (i.e., no domains) of a nonlinear function of HT estimators:
# - Average value added per employee
# Compute the linearized variable of the estimator:
z <- get.linvar(sbsdes, expression(va.imp2/emp.num))
# Have a look:
head(z)
# Now add that variable to the input design object:
sbsdes <- des.addvars(sbsdes, z = z)
# Now compute the estimated standard error of the HT total of the linearized variable:
svystatTM(sbsdes, ~z)
# And compare it to the estimated standard error of the complex estimator:
svystatL(sbsdes, expression(va.imp2/emp.num))
# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbsdes, num = ~va.imp2, den= ~emp.num)
# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~z)$SE,
svystatL(sbsdes, expression(va.imp2/emp.num))$SE)
# OK
#############
## Check 2 ##
#############
# Domain estimates of a nonlinear function of HT estimators:
# - Average value added per employee by NACE macro-sectors
# Compute the linearized variable(s) of the domain estimator(s):
z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro)
# Have a look:
head(z.dom)
tail(z.dom)
# Note the staggered matrix-format of the result. Since you are dealing with HT estimators
# (namely sbsdes is not calibrated), you can opt for the more convenient (but equivalent)
# stacked format:
z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro, stack = TRUE)
# Have a look:
head(z.dom)
tail(z.dom)
# Now add that variable to the input design object:
sbsdes <- des.addvars(sbsdes, z.dom = z.dom)
# Now compute the estimated standard error of the HT total of the linearized variable by
# domains:
svystatTM(sbsdes, ~z.dom, by= ~nace.macro)
# And compare it to the estimated standard error of the complex estimator:
svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)
# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbsdes, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)
# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~z.dom, by= ~nace.macro)$SE,
svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)$SE)
# OK
###################################################################
# Now run the same checks for functions of Calibration Estimators #
###################################################################
# Build a population totals template and fill it with actual known totals computed from
# the sampling frame (sbs.frame):
pop <- pop.template(sbsdes, calmodel= ~ent:emp.cl + y:nace.macro-1, partition= ~region)
pop <- fill.template(sbs.frame, pop)
# Calibrate the weights:
sbscal <- e.calibrate(sbsdes, pop)
g.range(sbscal)
#############
## Check 3 ##
#############
# Global estimate (i.e., no domains) of a nonlinear function of CAL estimators:
# - Average value added per employee
# Compute the linearized variable of the estimator (the name gez should remind you of
# the g-weighted residuals of the HT linearized variable w.r.t. the calibration model):
gez <- get.linvar(sbscal, expression(va.imp2/emp.num))
# Have a look:
head(gez)
# Now add that variable to the initial *uncalibrated* design object:
sbsdes <- des.addvars(sbsdes, gez = gez)
# Now compute the estimated standard error of the *HT total* of the linearized variable:
svystatTM(sbsdes, ~gez)
# And compare it to the estimated standard error of the complex *calibration* estimator:
svystatL(sbscal, expression(va.imp2/emp.num))
# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbscal, num = ~va.imp2, den= ~emp.num)
# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~gez)$SE,
svystatL(sbscal, expression(va.imp2/emp.num))$SE)
# OK
#############
## Check 4 ##
#############
# Domain estimates of a nonlinear function of CAL estimators:
# - Average value added per employee by NACE macro-sectors
# Compute the linearized variable(s) of the domain estimator(s):
gez.dom <- get.linvar(sbscal, expression(va.imp2/emp.num), by = ~nace.macro)
# Have a look:
head(gez.dom)
tail(gez.dom)
# As you see the matrix no longer has the staggered structure of the HT case. In fact, the
# calibration residuals are in general not 0 outside the active domain. Therefore, you
# cannot opt for the more convenient stacked format!
# Let's thus add all the column of the linearized variable(s) matrix to the initial
# *uncalibrated* design object:
sbsdes$variables <- cbind(sbsdes$variables, gez.dom)
# Now compute the estimated standard error of the *HT totals* of the *domain-specific*
# linearized variables:
svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)
# And compare it to the estimated standard errors of the complex *calibration* estimator
# by domains:
svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)
# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbscal, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)
# Test the equality of the results:
all.equal(
svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)$SE,
svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)$SE)
# OK
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.