kottcalibrate: Calibration of replicate weights
In DiegoZardetto/EVER: Estimation of Variance by Efficient Replication

Description Usage Arguments Details Value Calibration process diagnostics Author(s) References See Also Examples

Adds to a kott.design object the calibrated weights columns (one for each replicate weight, plus one for the initial weights).

kottcalibrate(deskott, df.population,
        calmodel = if (inherits(df.population, "pop.totals"))
                       attr(df.population, "calmodel"),
        partition = if (inherits(df.population, "pop.totals"))
                        attr(df.population, "partition") else FALSE,
        calfun = c("linear", "raking", "logit"),
        bounds = c(-Inf, Inf), aggregate.stage = NULL, maxit = 50,
        epsilon = 1e-07, force.rep = FALSE)

`deskott`	Object of class `kott.design` containing the replicated survey data.
`df.population`	Data frame containing the known population totals for the auxiliary variables.
`calmodel`	Formula defining the linear structure of the calibration model.
`partition`	Formula specifying the variables that define the "calibration domains" for the model (see 'Details'); `FALSE` (the default) implies no calibration domains.
`calfun`	`character` specifying the distance function for the calibration process; the default is `"linear"`.
`bounds`	Allowed range for the ratios between calibrated and initial weights; the default is `c(-Inf,Inf)`.
`aggregate.stage`	An integer: if specified, causes the calibrated weights to be constant within sampling units at this stage.
`maxit`	Maximum number of iterations for the Newton-Raphson algorithm; the default is `50`.
`epsilon`	Tolerance for the relative differences between the population totals and the corresponding estimates based on the claibrated weights; the default is `10^-7`.
`force.rep`	If `TRUE`, whenever the calibration algorithm does not converge for a given set of replicate weights, forces the function to return a value (see 'Details'); the default is `FALSE`.

This function creates an object of class kott.cal.design. A kott.cal.design object is made up by the union of the (calibrated) replicated survey data and the metadata describing the sampling design. kott.cal.design objects make it possible to estimate the variance of calibration estimators [Deville, Sarndal 92] using the extended "Delete-A-Group Jackknife" method [Kott 2008].

The mandatory argument calmodel symbolically defines the calibration model you want to use, that is - in the language of the generalised regression estimator - the assisting linear regression model underlying the calibration problem [Wilkinson, Rogers 73]. More specifically, the calmodel formula identifies the auxiliary variables and the constraints for the calibration problem. For example, calmodel=~(X+Z):C+(A+B):D defines the calibration problem in which constraints are imposed: (i) on the auxiliary (quantitative) variables X and Z within the subpopulations identified by the (qualitative) classification variable C and, at the same time, (ii) on the absolute frequency of the (qualitative) variables A and B within the subpopulations identified by the (qualitative) classification variable D.
The deskott variables referenced by calmodel must be numeric or factor and must not contain any missing value (NA).

Problems for which one or more qualitative variables can be "factorised" in the formula that specifies the calibration model, are particularly interesting. These variables split the population into non-overlapping subpopulations known as "calibration domains" for the model. An example is provided by the statement calmodel=~(A+B+X+Z):D in which the variable that identifies the calibration domains is D; similarly, the formula calmodel=~(A+B+X+Z):D1:D2 identifies as calibration domains the subpopulations determined by crossing the modalities of D1 and D2. The interest in models of this kind lies in the fact that the global calibration problem they describe can, actually, be broken down into local subproblems, one per calibration domain, which can be solved separately [Vanderhoeft 01]. Thus, for example, the global problem defined by calmodel=~(A+B+X+Z):D is equivalent to the sequence of problems defined by the "reduced model" calmodel=~A+B+X+Z in each of the domains identified by the modalities of D. The opportunity to separately solve the subproblems related to different calibration domains achieves a significant reduction in computation complexity: the gain increases with increasing survey data size and (most importantly) with increasing auxiliary variables number.

The optional argument partition makes it possible to choose, in cases in which the calibration problem can be factorised, whether to solve the problem globally or iteratively (that is, separately for each calibration domain). The global solution (which is the default option) can be selected invoking the kottcalibrate function with partition=FALSE. To request the iterative solution - a strongly recommended option when dealing with a lot of auxiliary variables and big data sizes - it is necessary to specify via partition the variables defining the calibration domains for the model. If a formula is passed through the partition argument (for example: partition=~D1:D2), the program checks that calmodel actually describes a "reduced model" (for example: calmodel=~X+Z+A+B), that is it does not reference any of the partition variables; if this is not the case, the program stops and prints an error message.
The deskott variables referenced by partition (if any) must be factor and must not contain any missing value (NA).

The mandatory argument df.population is used to specify the known totals of the auxiliary variables referenced by calmodel within the subpopulations (if any) identified by partition. These known totals must be stored in a data frame whose structure (i) depends on the values of calmodel and partition and (ii) must conform to a standard. In order to facilitate understanding of and compliance with this standard, the EVER package provides the user with two functions: pop.template and population.check. The pop.template function is able to guide the user in constructing the known totals data frame for a specific calibration problem, while the population.check function allows to check whether a known totals data frame conforms to the standard required by kottcalibrate. In any case, if the df.population data frame does not comply with the standard, the kottcalibrate function stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem.

The calfun argument identifies the distance function to be used in the calibration process. Three built-in functions are provided: "linear", "raking", and "logit". The default is "linear", which corresponds to the euclidean metric.

The bounds argument allows to add "range constraints" to the calibration problem. To be precise, the interval defined by bounds will contain the values of the ratios between final (calibrated) and initial (direct) weights. The default value is c(-Inf,Inf), i.e. no range constraints are imposed. These constraints are optional unless the "logit" function is selected: in the latter case the range defined by bounds has to be finite.

The value passed by the aggregate.stage argument must be an integer between 1 and the number of sampling stages of deskott. If specified, causes the calibrated weights to be constant within sampling units selected at the aggregate.stage stage (actually this is only ensured if the initial weights had already this property, as is sometimes the case in multistage cluster sampling). If not specified, the calibrated weights may differ even for sampling units with identical initial weights. The same holds if some final units belonging to the same cluster selected at the stage aggregate.stage fall in distinct calibration domains (i.e. if the domains defined by partition "cut across" the aggregate.stage-stage clusters).

The maxit argument sets the maximum number of iteration for the Newton-Raphson algorithm that is used to solve the calibration problem. The default value is 50.

The epsilon argument determines the convergence criterion for the optimisation algorithm: it fixes the maximum allowed value for the relative differences between the population totals and the corresponding estimates based on the claibrated weights. The default value is 10^-7.

If the number of replicates for deskott (the input object of class kott.design) is nrg, the function kottcalibrate is in charge of solving nrg+1 distinct calibration problems. In fact, the calibrated weights calculated by kottcalibrate must ensure that the known population totals are exactly reproduced not only by the original sample, but also by all its nrg replicates. Should this requirement fail, the DAGJK method would end up with a biased variance estimator [Kott 2008]. It is, however, possible (more likely when range constraints are imposed) that, for some of the nrg+1 distinct calibration problems and for the given values of epsilon and maxit, the solving algorithm does not converge. In this case kottcalibrate by default stops and prints an error message. On the contrary if force.rep = TRUE, provided that the failure to converge pertains only to the replicate weights, the function is forced to return the best approximation achieved for the corresponding calibrated weights. When this occurs, DAGJK standard errors estimates built on the object returned by kottcalibrate will be biased.

An object of class kott.cal.design. The data frame it contains includes (in addition to the data already stored in deskott) the calibrated weights columns (one for each replicate weight, plus one for the initial weights, nrg+1 in all). The names of these columns are obtained by pasting the name of the initial weights column with the string ".cal" and the indices NULL, 1, 2, ..., nrg.
The kott.cal.design class is a specialisation of the kott.design class; this means that an object created by kottcalibrate inherits from the data.frame class and you can use on it every method defined on that class.

If the number of replicates for deskott is nrg, the function kottcalibrate is in charge of solving nrg+1 distinct calibration problems. When, dealing with a factorisable calibration problem, the user selects the iterative solution, each one of the above mentioned problems is split into as many sub-problems as the number of subpopulations defined by partition. A calibration process with such a complex structure needs some ad hoc tool for error diagnostics. For this purpose, every call to kottcalibrate creates, by side effect, a dedicated data structure named kottcal.status into the .GlobalEnv. kottcal.status is a list with two components: the first, "call", identifies the call to kottcalibrate that generated the list, the second, return.code, is a matrix each element of which identifies the return code of a specific calibration sub-problem. The meaning of the return codes is as follows:

-1: not yet tackled sub-problem;
0: solved sub-problem (convergence achieved);
1: unsolved sub-problem (no convergence): output forced.

Recall that the latter return code may only occur if force.rep = TRUE.
In case of error, users can exploit kottcal.status to identify the sub-problem from which the error stemmed, hence taking a step forward to eliminate it.

Diego Zardetto

Deville, J.C., Sarndal, C.E. (1992) "Calibration Estimators in Survey Sampling", Journal of the American Statistical Association, Vol. 87, No. 418, pp.376-382.

Kott, Phillip S. (2008) "Building a Better Delete-a-Group Jackknife for a Calibration Estimator", NASS Research Report, NASS: Washington, DC.

Wilkinson, G.N., Rogers, C.E. (1973) "Symbolic Description of Factorial Models for Analysis of Variance", Journal of the Royal Statistical Society, series C (Applied Statistics), Vol. 22, pp. 181-191.

Vanderhoeft, C. (2001) "Generalized Calibration at Statistic Belgium", Statistics Belgium Working Paper n. 3, http://www.statbel.fgov.be/studies/paper03_en.asp.

Lumley, T. (2006) "survey: analysis of complex survey samples", http://cran.at.r-project.org/web/packages/survey/index.html.

Scannapieco, M., Zardetto, D., Barcaroli, G. (2007) "La Calibrazione dei Dati con R: una Sperimentazione sull'Indagine Forze di Lavoro ed un Confronto con GENESEES/SAS", Contributi Istat n. 4., https://www.istat.it/it/files//2018/07/2007_4.pdf.

desc for a concise description of kott.design objects, kottby, kott.ratio, kott.regcoef, kott.quantile and kottby.user for calculating estimates and standard errors, pop.template for constructing known totals data frames in compliance with the standard required by kottcalibrate, population.check to check that the known totals data frame satisfies that standard, bounds.hint to obtain an hint for range restricted calibration.

# Calibration of a kott.design object according to different calibration
# models (the known totals data frames pop01, \ldots, pop05p and the bounds
# vector are contained in the data.examples file).
# For the examples relating to calibration models that can be factorised 
# both a global and an iterative solution are given.

data(data.examples)

# Creation of the object to be calibrated:
kdes<-kottdesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
      weights=~weight,nrg=15)


# 1) Calibration on the total number of units in the population
#   (totals in pop01):
kdescal01<-kottcalibrate(deskott=kdes,df.population=pop01,calmodel=~1,
           calfun="logit",bounds=bounds,aggregate.stage=2)

# Checking the result (the function 'ones' is contained
# in the data.examples file):
kottby.user(kdescal01,user.estimator=ones)


# 2) Calibration on the marginal distributions of sex and marstat
#    (totals in pop02):
kdescal02<-kottcalibrate(deskott=kdes,df.population=pop02,
           calmodel=~sex+marstat-1,calfun="logit",bounds=bounds,
           aggregate.stage=2)

# Checking the result:
kottby(kdescal02,~sex+marstat)


# 3) Calibration (global solution) on the joint distribution of sex
#    and marstat (totals in pop03):
kdescal03<-kottcalibrate(deskott=kdes,df.population=pop03,
           calmodel=~marstat:sex-1,calfun="logit",bounds=bounds)

# Checking the result:
kottby(kdescal03,~sex,~marstat) # or: kottby(kdescal03,~marstat,~sex)

# which, obviously, is not respected by kdescal02 (notice the size of SE):
kottby(kdescal02,~sex,~marstat)


# 3.1) Again a calibration on the joint distribution of sex and marstat
#      but, this time, with the iterative solution (partition=~sex,
#      totals in pop03p):
kdescal03p<-kottcalibrate(deskott=kdes,df.population=pop03p,
            calmodel=~marstat-1,partition=~sex,calfun="logit",
            bounds=bounds)

# Checking the result:
kottby(kdescal03p,~sex,~marstat)


# 4) Calibration (global solution) on the totals for the quantitative
#    variables x1, x2 and x3 in the subpopulations defined by the
#    regcod variable (totals in pop04):
kdescal04<-kottcalibrate(deskott=kdes,df.population=pop04,
           calmodel=~(x1+x2+x3-1):regcod,calfun="logit",
           bounds=bounds,aggregate.stage=2)

# Checking the result:
kottby(kdescal04,~x1+x2+x3,~regcod)


# 4.1) Same problem with the iterative solution (partition=~regcod,
#      totals in pop04p):
kdescal04p<-kottcalibrate(deskott=kdes,df.population=pop04p,
            calmodel=~x1+x2+x3-1,partition=~regcod,calfun="logit",
            bounds=bounds,aggregate.stage=2)

# Checking the result:
kottby(kdescal04p,~x1+x2+x3,~regcod)


# 5) Calibration (global solution) on the total for the quantitative
#    variable x1 and on the marginal distribution of the qualitative
#    variable age5c, in the subpopulations defined by crossing sex
#    and marstat (totals in pop05):
kdescal05<-kottcalibrate(deskott=kdes,df.population=pop05,
           calmodel=~(age5c+x1-1):sex:marstat,calfun="logit",
           bounds=bounds,force.rep=TRUE)

# Calibration process diagnostics:
kottcal.status

# Checking the result:
kottby(kdescal05,~age5c+x1,~sex:marstat)


# 5.1) Same problem with the iterative solution (partition=~sex:marstat,
#      totals in pop05p):
kdescal05p<-kottcalibrate(deskott=kdes,df.population=pop05p,
            calmodel=~age5c+x1-1,partition=~sex:marstat,
            calfun="logit",bounds=bounds,force.rep=TRUE)

# Calibration process diagnostics:
kottcal.status

# Checking the result:
kottby(kdescal05p,~age5c+x1,~sex:marstat)


# Notice that 3.1 e 5.1) do not impose the aggregate.stage=2
# condition. This condition cannot, in fact, be fulfilled because
# in both cases the domains defined by partition "cut across"
# the kdes second stage clusters (households). To compare the results,
# the same choice was also made for 3) e 5).