crossValidation: Simulate variables of population data by cross validation

View source: R/crossValidation.R

crossValidationR Documentation

Simulate variables of population data by cross validation

Description

Simulate variables of population data. The household structure of the population data needs to be simulated beforehand.

Usage

crossValidation(
  simPopObj,
  additionals,
  hyper_param_grid,
  fold = 3,
  method = c("xgboost"),
  type = c("categorical"),
  by = "strata",
  regModel = "available",
  nr_cpus = 1,
  verbose = FALSE
)

Arguments

simPopObj

a simPopObj containing population and household survey data as well as optionally margins in standardized format.

additionals

a character vector specifying additional categorical variables available in the sample object of simPopObj that should be simulated for the population data.

hyper_param_grid

a grid which can contain model specific parameters which will be passed onto the function call for the respective model.

fold

the number of k in k-fold crossvalidation

method

a character string specifying the method to be used for simulating the additional categorical variables. Accepted value at the moment only "xgboost" for using xgboost (implementation in package xgboost)

type

currently only "categorical" is implemented

by

defining which variable to use as split up variable of the estimation. Defaults to the strata variable.

regModel

allows to specify the variables or model that is used when simulating additional categorical variables. The following choices are available if different from NULL.

  • 'basic'only the basic household variables (generated with simStructure) are used.

  • 'available'all available variables (that are common in the sample and the synthetic population such as previously generated varaibles) excluding id-variables, strata variables and household sizes are used for the modelling. This parameter should be used with care because all factors are automatically used as factors internally.

  • formula-objectUsers may also specify a specifiy formula (class 'formula') that will be used. Checks are performed that all required variables are available.

If method 'distribution' is used, it is only possible to specify a vector of length one containing one of the choices described above. If parameter 'regModel' is NULL, only basic household variables are used in any case.

nr_cpus

if specified, an integer number defining the number of cpus that should be used for parallel processing.

verbose

set to TRUE if additional print output should be shown.

Details

The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.

Value

An object of class simPopObj containing survey data as well as the simulated population data including the categorical variables specified by argument additional.

Note

The basic household structure needs to be simulated beforehand with the function simStructure.

Author(s)

Bernhard Meindl, Andreas Alfons, Stefan Kraft, Alexander Kowarik, Matthias Templ, Siro Fritzmann

See Also

simStructure, simRelation, simContinuous, simComponents, simCategorical

Examples

data(eusilcS) # load sample data
## Not run: 
## approx. 20 seconds computation time
inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090")
## in the following, nr_cpus are selected automatically
simPop <- simStructure(data=inp, method="direct", basicHHvars=c("age", "rb090"))
grid <- expand.grid(nrounds = c(5, 10),
                    max_depth = 10,
                    eta = c(0.2, 0.3, 0.5),
                    eval_metric = "mlogloss",
                    stringsAsFactors = FALSE)

simPop <- crossValidation(simPop, additionals=c("pl030", "pb220a"),
nr_cpus=1, hyper_param_grid = grid)
simPop

## End(Not run)

simPop documentation built on Nov. 10, 2022, 5:43 p.m.