knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette explains the usage of the
ipf() function, which has been used for calibrating the labour force survey of Austria for several years.
It is based on the Iterative Proportional Fitting algorithm and gives some flexibility about the details of the implementation. See [@mekogu2016] or
vignette("methodology") for more details.
We will assume the output of
demo.eusilc() is our population.
From this population, a sample without replacement is drawn.
The sample covers 10 percent of the population.
We assign a weight of one for all observations of the population and a weight of ten for all observations of the sample.
library(surveysd) population <- demo.eusilc(1, prettyNames = TRUE) population[, pWeight := 1] pop_sample <- population[sample(1:.N, floor(.N*0.10)), ] pop_sample[, pWeight := 10]
We will start with an example where we want to adapt the weights of
pop_sample such that the weighted number of males and females matches the ones of
We can see that this is currently not the case.
(gender_distribution <- xtabs(pWeight ~ gender, population)) xtabs(pWeight ~ gender, pop_sample)
Due to random sampling (rather than stratified sampling), there are differences between the gender distributions.
We can pass
gender_distribution as a parameter to
ipf() to obtain modified weights.
pop_sample_c <- ipf(pop_sample, conP = list(gender_distribution), w = "pWeight")
The resulting dataset,
pop_sample_c is similar to
pop_sample but has an additional column with the adjusted weights.
dim(pop_sample) dim(pop_sample_c) setdiff(names(pop_sample_c), names(pop_sample))
We can now calculate the weighted number of males and females according to this new weight. This will result in a match for the constraints.
xtabs(calibWeight ~ gender, pop_sample_c) xtabs(pWeight ~ gender, population)
In this simple case,
ipf just performs a post stratification step.
This means, that all males and all females have the same weight.
xtabs(~ calibWeight + gender, pop_sample_c)
overrepresented_gender <- pop_sample_c[calibWeight < 10, ][1, gender]
r overrepresented_genders have been weighted down (
calibWeight < 10) to compensate for the overrepresentation in the sample.
Let's now assume that we want to put constraints on the number of males and females for each age group.
The numbers from the original population can be obtained with
(con_ga <- xtabs(pWeight ~ gender + age, population)) xtabs(pWeight ~ gender + age, pop_sample)
Again, we can see that those constraints are not met.
Supplying the contingency table
ipf() will again resolve this.
pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga), w = "pWeight") xtabs(pWeight ~ gender + age, population) xtabs(calibWeight ~ gender + age, pop_sample_c2)
Now we assume that we know the number of persons living in each nuts2 region from registry data.
registry_table <- xtabs(pWeight ~ region, population)
However, those registry data does not provide any information about age or
Therefore, the two contingency tables (
registry_table) have to be specified independently.
This can be done by supplying a list to
pop_sample_c2 <- ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight") xtabs(pWeight ~ gender + age, population) xtabs(calibWeight ~ gender + age, pop_sample_c2) xtabs(pWeight ~ region, population) xtabs(calibWeight ~ region, pop_sample_c2)
this time, the constraints are not matched perfectly.
That is, because we provided more than one constraint.
ipf() algorithm had to work iteratively.
If the dataset has a household structure, household constraints can be passed
via the parameter
conH. If this parameter is used, it is also necessary to
hid, which defines the column names that contains household ids.
(conH1 <- xtabs(pWeight ~ hsize + region, data = population[!duplicated(hid)])) pop_sample_hh <- ipf(pop_sample, hid = "hid", conH = list(conH1), w = "pWeight", bound = 10) xtabs(calibWeight ~ hsize + region, data = pop_sample_hh[!duplicated(hid)])
conH contain several contingency tables or if
are used at the same time, the ipf algorithm will operate iteratively. This
means that the calibrated dataset will satisfy the constraints only
approximately. The default tolerances of the approximation can be overwritten
using the parameters
Lowering the tolerances will improve the match between the constraints and the contingency tables according to the calibrated weights. However, lower tolerances will also make it so more iterations are necessary until a convergence is met. If the constraints are too small, ipf will return with a warning that indicates that a convergence could not be reached.
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight", verbose = TRUE, epsP = 0.01) ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight", verbose = TRUE, epsP = 0.0001)
We see that changing the tolerances from
0.01 (one percent) to
increases the number of required iterations.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.