calibPop: Calibration of 0/1 weights by Simulated Annealing

Description Usage Arguments Details Value Author(s) References Examples

View source: R/calibPop.R

Description

A Simulated Annealing Algorithm for calibration of synthetic population data available in a simPopObj-object. The aims is to find, given a population, a combination of different households which optimally satisfy, in the sense of an acceptable error, a given table of specific known marginals. The known marginals are also already available in slot 'table' of the input object 'inp'.

Usage

1
2
3
4
5
calibPop(inp, split, temp = 1, eps.factor = 0.05, maxiter = 200,
  temp.cooldown = 0.9, factor.cooldown = 0.85, min.temp = 10^-3,
  nr_cpus = NULL, sizefactor = 2, memory = TRUE, choose.temp = TRUE,
  choose.temp.factor = 0.2, scale.redraw = 0.5, observe.times = 50,
  observe.break = 0.05, verbose = FALSE)

Arguments

inp

an object of class simPopObj with slot 'table' being non-null! (see addKnownMargins).

split

given strata in which the problem will be split. Has to correspond to a column population data (slot 'pop' of input argument 'inp') . For example split = c("region"), problem will be split for different regions. Parallel computing is performed automatically, if possible.

temp

starting temperatur for simulated annealing algorithm

eps.factor

a factor (between 0 and 1) specifying the acceptance error. For example eps.factor = 0.05 results in an acceptance error for the objective function of 0.05*sum(totals).

maxiter

maximum iterations during a temperature step.

temp.cooldown

a factor (between 0 and 1) specifying the rate at which temperature will be reduced in each step.

factor.cooldown

a factor (between 0 and 1) specifying the rate at which the number of permutations of housholds, in each iteration, will be reduced in each step.

min.temp

minimal temperature at which the algorithm will stop.

nr_cpus

if specified, an integer number defining the number of cpus that should be used for parallel processing.

sizefactor

the factor for inflating the population before applying 0/1 weights

memory

if TRUE simulated annealing is applied in less memory intensive way. Is especially usefull if factor or population is large. For this option simulated annealing is not entirely implemented in C++, therefore it might be slower than option memory=FALSE.

choose.temp

if TRUE temp will be rescaled according to eps and choose.temp.factor. eps is defined by the product between eps_factore and the sum over the target population margins, see addKnownMargins. Only used if memory=TRUE.

choose.temp.factor

number between (0,1) for rescaling temp for simulated annealing. temp redefined bymax(temp,eps*choose.temp.factor). Can be usefull if simulated annealing is split into subgroups with considerably different population sizes. Only used if choose.temp=TRUE and memory=TRUE.

scale.redraw

Only used if memory=TRUE. Number between (0,1) scaling the number of households that need to be drawn and discarded in each iteration step. The number of individuals currently selected through simulated annealing is substracted from the sum over the target population margins added to inp via addKnownMargins. This difference is divided by the median household size resulting in an estimated number of housholds that the current synthetic population differs from the population margins (~redraw_gap). The next iteration will then adjust the number of housholds to be drawn or discarded (redraw) according to max(ceiling(redraw-redraw_gap*scale.redraw),1) or max(ceiling(redraw+redraw_gap*scale.redraw),1) respectively. This keeps the number of individuals in the synthetic population relatively stable regarding the population margins. Otherwise the synthetic population might be considerably larger or smaller then the population margins, through selection of many large or small households.

observe.times

Only used if memory=TRUE. Number of times the new value of the objective function is saved. If observe.times=0 values are not saved.

observe.break

Only used if memory=TRUE. When objective value has been saved observe.times-times the coefficient of variation is calculated over saved values; if the coefficient of variation falls below observe.break simmulated annealing terminates. This repeats for each new set of observe.times new values of the objecive function. Can help save run time if objective value does not improve much. Disable this termination by either setting observe.times=0 or observe.break=0.

verbose

boolean variable; if TRUE some additional verbose output is provided, however only if split is NULL. Otherwise the computation is performed in parallel and no useful output can be provided.

Details

Calibrates data using simulated annealing. The algorithm searches for a (near) optimal combination of different households, by swaping housholds at random in each iteration of each temperature level. During the algorithm as well as for the output the optimal (or so far best) combination will be indicated by a logical vector containg only 0s (not inculded) and 1s (included in optimal selection). The objective function for simulated annealing is defined by the sum of absolute differences between target marginals and synthetic marginals (=marginals of synthetic dataset). The sum of target marginals can at most be as large as the sum of target marginals. For every factor-level in “split”, data must at least contain as many entries of this kind as target marginals.

Possible donors are automatically generated within the procedure.

The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.

Value

Returns an object of class simPopObj with an updated population listed in slot 'pop'.

Author(s)

Bernhard Meindl, Johannes Gussenbauer and Matthias Templ

References

M. Templ, B. Meindl, A. Kowarik, A. Alfons, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi: 10.18637/jss.v079.i10

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
data(eusilcS) # load sample data
data(eusilcP) # population data
## Not run: 
## approx. 20 seconds computation time
inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090")
simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090"))
simPop <- simCategorical(simPop, additional=c("pl030", "pb220a"), method="multinom", nr_cpus=1)

# add margins
margins <- as.data.frame(
  xtabs(rep(1, nrow(eusilcP)) ~ eusilcP$region + eusilcP$gender + eusilcP$citizenship))
colnames(margins) <- c("db040", "rb090", "pb220a", "freq")
simPop <- addKnownMargins(simPop, margins)
simPop_adj2 <- calibPop(simPop, split="db040", temp=1, eps.factor=0.1,memory=TRUE)

## End(Not run)
# apply simulated annealing
## Not run: 
## long computation time
simPop_adj <- calibPop(simPop, split="db040", temp=1, eps.factor=0.1,memory=FALSE)

## End(Not run)

statistikat/simPop documentation built on July 24, 2018, 10:55 a.m.