# kottcalibrate: Calibration of replicate weights In DiegoZardetto/EVER: Estimation of Variance by Efficient Replication

## Description

Adds to a `kott.design` object the calibrated weights columns (one for each replicate weight, plus one for the initial weights).

## Usage

 ```1 2 3 4 5 6 7 8``` ```kottcalibrate(deskott, df.population, calmodel = if (inherits(df.population, "pop.totals")) attr(df.population, "calmodel"), partition = if (inherits(df.population, "pop.totals")) attr(df.population, "partition") else FALSE, calfun = c("linear", "raking", "logit"), bounds = c(-Inf, Inf), aggregate.stage = NULL, maxit = 50, epsilon = 1e-07, force.rep = FALSE) ```

## Arguments

 `deskott` Object of class `kott.design` containing the replicated survey data. `df.population` Data frame containing the known population totals for the auxiliary variables. `calmodel` Formula defining the linear structure of the calibration model. `partition` Formula specifying the variables that define the "calibration domains" for the model (see 'Details'); `FALSE` (the default) implies no calibration domains. `calfun` `character` specifying the distance function for the calibration process; the default is `"linear"`. `bounds` Allowed range for the ratios between calibrated and initial weights; the default is `c(-Inf,Inf)`. `aggregate.stage` An integer: if specified, causes the calibrated weights to be constant within sampling units at this stage. `maxit` Maximum number of iterations for the Newton-Raphson algorithm; the default is `50`. `epsilon` Tolerance for the relative differences between the population totals and the corresponding estimates based on the claibrated weights; the default is `10^-7`. `force.rep` If `TRUE`, whenever the calibration algorithm does not converge for a given set of replicate weights, forces the function to return a value (see 'Details'); the default is `FALSE`.

## Details

This function creates an object of class `kott.cal.design`. A `kott.cal.design` object is made up by the union of the (calibrated) replicated survey data and the metadata describing the sampling design. `kott.cal.design` objects make it possible to estimate the variance of calibration estimators [Deville, Sarndal 92] using the extended "Delete-A-Group Jackknife" method [Kott 2008].

The mandatory argument `calmodel` symbolically defines the calibration model you want to use, that is - in the language of the generalised regression estimator - the assisting linear regression model underlying the calibration problem [Wilkinson, Rogers 73]. More specifically, the `calmodel` formula identifies the auxiliary variables and the constraints for the calibration problem. For example, `calmodel=~(X+Z):C+(A+B):D` defines the calibration problem in which constraints are imposed: (i) on the auxiliary (quantitative) variables `X` and `Z` within the subpopulations identified by the (qualitative) classification variable `C` and, at the same time, (ii) on the absolute frequency of the (qualitative) variables `A` and `B` within the subpopulations identified by the (qualitative) classification variable `D`.
The `deskott` variables referenced by `calmodel` must be `numeric` or `factor` and must not contain any missing value (`NA`).

Problems for which one or more qualitative variables can be "factorised" in the formula that specifies the calibration model, are particularly interesting. These variables split the population into non-overlapping subpopulations known as "calibration domains" for the model. An example is provided by the statement `calmodel=~(A+B+X+Z):D` in which the variable that identifies the calibration domains is `D`; similarly, the formula `calmodel=~(A+B+X+Z):D1:D2` identifies as calibration domains the subpopulations determined by crossing the modalities of `D1` and `D2`. The interest in models of this kind lies in the fact that the global calibration problem they describe can, actually, be broken down into local subproblems, one per calibration domain, which can be solved separately [Vanderhoeft 01]. Thus, for example, the global problem defined by `calmodel=~(A+B+X+Z):D` is equivalent to the sequence of problems defined by the "reduced model" `calmodel=~A+B+X+Z` in each of the domains identified by the modalities of `D`. The opportunity to separately solve the subproblems related to different calibration domains achieves a significant reduction in computation complexity: the gain increases with increasing survey data size and (most importantly) with increasing auxiliary variables number.

The optional argument `partition` makes it possible to choose, in cases in which the calibration problem can be factorised, whether to solve the problem globally or iteratively (that is, separately for each calibration domain). The global solution (which is the default option) can be selected invoking the `kottcalibrate` function with `partition=FALSE`. To request the iterative solution - a strongly recommended option when dealing with a lot of auxiliary variables and big data sizes - it is necessary to specify via `partition` the variables defining the calibration domains for the model. If a formula is passed through the `partition` argument (for example: `partition=~D1:D2`), the program checks that `calmodel` actually describes a "reduced model" (for example: `calmodel=~X+Z+A+B`), that is it does not reference any of the partition variables; if this is not the case, the program stops and prints an error message.
The `deskott` variables referenced by `partition` (if any) must be `factor` and must not contain any missing value (`NA`).

The mandatory argument `df.population` is used to specify the known totals of the auxiliary variables referenced by `calmodel` within the subpopulations (if any) identified by `partition`. These known totals must be stored in a data frame whose structure (i) depends on the values of `calmodel` and `partition` and (ii) must conform to a standard. In order to facilitate understanding of and compliance with this standard, the EVER package provides the user with two functions: `pop.template` and `population.check`. The `pop.template` function is able to guide the user in constructing the known totals data frame for a specific calibration problem, while the `population.check` function allows to check whether a known totals data frame conforms to the standard required by `kottcalibrate`. In any case, if the `df.population` data frame does not comply with the standard, the `kottcalibrate` function stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem.

The `calfun` argument identifies the distance function to be used in the calibration process. Three built-in functions are provided: `"linear"`, `"raking"`, and `"logit"`. The default is `"linear"`, which corresponds to the euclidean metric.

The `bounds` argument allows to add "range constraints" to the calibration problem. To be precise, the interval defined by `bounds` will contain the values of the ratios between final (calibrated) and initial (direct) weights. The default value is `c(-Inf,Inf)`, i.e. no range constraints are imposed. These constraints are optional unless the `"logit"` function is selected: in the latter case the range defined by `bounds` has to be finite.

The value passed by the `aggregate.stage` argument must be an integer between 1 and the number of sampling stages of `deskott`. If specified, causes the calibrated weights to be constant within sampling units selected at the `aggregate.stage` stage (actually this is only ensured if the initial weights had already this property, as is sometimes the case in multistage cluster sampling). If not specified, the calibrated weights may differ even for sampling units with identical initial weights. The same holds if some final units belonging to the same cluster selected at the stage `aggregate.stage` fall in distinct calibration domains (i.e. if the domains defined by `partition` "cut across" the `aggregate.stage`-stage clusters).

The `maxit` argument sets the maximum number of iteration for the Newton-Raphson algorithm that is used to solve the calibration problem. The default value is `50`.

The `epsilon` argument determines the convergence criterion for the optimisation algorithm: it fixes the maximum allowed value for the relative differences between the population totals and the corresponding estimates based on the claibrated weights. The default value is `10^-7`.

If the number of replicates for `deskott` (the input object of class `kott.design`) is `nrg`, the function `kottcalibrate` is in charge of solving `nrg+1` distinct calibration problems. In fact, the calibrated weights calculated by `kottcalibrate` must ensure that the known population totals are exactly reproduced not only by the original sample, but also by all its `nrg` replicates. Should this requirement fail, the DAGJK method would end up with a biased variance estimator [Kott 2008]. It is, however, possible (more likely when range constraints are imposed) that, for some of the `nrg+1` distinct calibration problems and for the given values of `epsilon` and `maxit`, the solving algorithm does not converge. In this case `kottcalibrate` by default stops and prints an error message. On the contrary if `force.rep = TRUE`, provided that the failure to converge pertains only to the replicate weights, the function is forced to return the best approximation achieved for the corresponding calibrated weights. When this occurs, DAGJK standard errors estimates built on the object returned by `kottcalibrate` will be biased.

## Value

An object of class `kott.cal.design`. The data frame it contains includes (in addition to the data already stored in `deskott`) the calibrated weights columns (one for each replicate weight, plus one for the initial weights, `nrg+1` in all). The names of these columns are obtained by pasting the name of the initial weights column with the string `".cal"` and the indices `NULL, 1, 2, ..., nrg`.
The `kott.cal.design` class is a specialisation of the `kott.design` class; this means that an object created by `kottcalibrate` inherits from the `data.frame` class and you can use on it every method defined on that class.

## Calibration process diagnostics

If the number of replicates for `deskott` is `nrg`, the function `kottcalibrate` is in charge of solving `nrg+1` distinct calibration problems. When, dealing with a factorisable calibration problem, the user selects the iterative solution, each one of the above mentioned problems is split into as many sub-problems as the number of subpopulations defined by `partition`. A calibration process with such a complex structure needs some ad hoc tool for error diagnostics. For this purpose, every call to `kottcalibrate` creates, by side effect, a dedicated data structure named `kottcal.status` into the `.GlobalEnv`. `kottcal.status` is a list with two components: the first, `"call"`, identifies the call to `kottcalibrate` that generated the list, the second, `return.code`, is a matrix each element of which identifies the return code of a specific calibration sub-problem. The meaning of the return codes is as follows:

`-1`

not yet tackled sub-problem;

`0`

solved sub-problem (convergence achieved);

`1`

unsolved sub-problem (no convergence): output forced.

Recall that the latter return code may only occur if `force.rep = TRUE`.
In case of error, users can exploit `kottcal.status` to identify the sub-problem from which the error stemmed, hence taking a step forward to eliminate it.

Diego Zardetto

## References

Deville, J.C., Sarndal, C.E. (1992) "Calibration Estimators in Survey Sampling", Journal of the American Statistical Association, Vol. 87, No. 418, pp.376-382.

Kott, Phillip S. (2008) "Building a Better Delete-a-Group Jackknife for a Calibration Estimator", NASS Research Report, NASS: Washington, DC.

Wilkinson, G.N., Rogers, C.E. (1973) "Symbolic Description of Factorial Models for Analysis of Variance", Journal of the Royal Statistical Society, series C (Applied Statistics), Vol. 22, pp. 181-191.

Vanderhoeft, C. (2001) "Generalized Calibration at Statistic Belgium", Statistics Belgium Working Paper n. 3, http://www.statbel.fgov.be/studies/paper03_en.asp.

Lumley, T. (2006) "survey: analysis of complex survey samples", http://cran.at.r-project.org/web/packages/survey/index.html.

Scannapieco, M., Zardetto, D., Barcaroli, G. (2007) "La Calibrazione dei Dati con R: una Sperimentazione sull'Indagine Forze di Lavoro ed un Confronto con GENESEES/SAS", Contributi Istat n. 4., https://www.istat.it/it/files//2018/07/2007_4.pdf.

## See Also

`desc` for a concise description of `kott.design` objects, `kottby`, `kott.ratio`, `kott.regcoef`, `kott.quantile` and `kottby.user` for calculating estimates and standard errors, `pop.template` for constructing known totals data frames in compliance with the standard required by `kottcalibrate`, `population.check` to check that the known totals data frame satisfies that standard, `bounds.hint` to obtain an hint for range restricted calibration.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110``` ```# Calibration of a kott.design object according to different calibration # models (the known totals data frames pop01, \ldots, pop05p and the bounds # vector are contained in the data.examples file). # For the examples relating to calibration models that can be factorised # both a global and an iterative solution are given. data(data.examples) # Creation of the object to be calibrated: kdes<-kottdesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight,nrg=15) # 1) Calibration on the total number of units in the population # (totals in pop01): kdescal01<-kottcalibrate(deskott=kdes,df.population=pop01,calmodel=~1, calfun="logit",bounds=bounds,aggregate.stage=2) # Checking the result (the function 'ones' is contained # in the data.examples file): kottby.user(kdescal01,user.estimator=ones) # 2) Calibration on the marginal distributions of sex and marstat # (totals in pop02): kdescal02<-kottcalibrate(deskott=kdes,df.population=pop02, calmodel=~sex+marstat-1,calfun="logit",bounds=bounds, aggregate.stage=2) # Checking the result: kottby(kdescal02,~sex+marstat) # 3) Calibration (global solution) on the joint distribution of sex # and marstat (totals in pop03): kdescal03<-kottcalibrate(deskott=kdes,df.population=pop03, calmodel=~marstat:sex-1,calfun="logit",bounds=bounds) # Checking the result: kottby(kdescal03,~sex,~marstat) # or: kottby(kdescal03,~marstat,~sex) # which, obviously, is not respected by kdescal02 (notice the size of SE): kottby(kdescal02,~sex,~marstat) # 3.1) Again a calibration on the joint distribution of sex and marstat # but, this time, with the iterative solution (partition=~sex, # totals in pop03p): kdescal03p<-kottcalibrate(deskott=kdes,df.population=pop03p, calmodel=~marstat-1,partition=~sex,calfun="logit", bounds=bounds) # Checking the result: kottby(kdescal03p,~sex,~marstat) # 4) Calibration (global solution) on the totals for the quantitative # variables x1, x2 and x3 in the subpopulations defined by the # regcod variable (totals in pop04): kdescal04<-kottcalibrate(deskott=kdes,df.population=pop04, calmodel=~(x1+x2+x3-1):regcod,calfun="logit", bounds=bounds,aggregate.stage=2) # Checking the result: kottby(kdescal04,~x1+x2+x3,~regcod) # 4.1) Same problem with the iterative solution (partition=~regcod, # totals in pop04p): kdescal04p<-kottcalibrate(deskott=kdes,df.population=pop04p, calmodel=~x1+x2+x3-1,partition=~regcod,calfun="logit", bounds=bounds,aggregate.stage=2) # Checking the result: kottby(kdescal04p,~x1+x2+x3,~regcod) # 5) Calibration (global solution) on the total for the quantitative # variable x1 and on the marginal distribution of the qualitative # variable age5c, in the subpopulations defined by crossing sex # and marstat (totals in pop05): kdescal05<-kottcalibrate(deskott=kdes,df.population=pop05, calmodel=~(age5c+x1-1):sex:marstat,calfun="logit", bounds=bounds,force.rep=TRUE) # Calibration process diagnostics: kottcal.status # Checking the result: kottby(kdescal05,~age5c+x1,~sex:marstat) # 5.1) Same problem with the iterative solution (partition=~sex:marstat, # totals in pop05p): kdescal05p<-kottcalibrate(deskott=kdes,df.population=pop05p, calmodel=~age5c+x1-1,partition=~sex:marstat, calfun="logit",bounds=bounds,force.rep=TRUE) # Calibration process diagnostics: kottcal.status # Checking the result: kottby(kdescal05p,~age5c+x1,~sex:marstat) # Notice that 3.1 e 5.1) do not impose the aggregate.stage=2 # condition. This condition cannot, in fact, be fulfilled because # in both cases the domains defined by partition "cut across" # the kdes second stage clusters (households). To compare the results, # the same choice was also made for 3) e 5). ```

DiegoZardetto/EVER documentation built on April 8, 2021, 1:05 p.m.