simContinuous: Simulate continuous variables of population data
In statistikat/simPop: Simulation of Complex Synthetic Data Information

simContinuous

R Documentation

Simulate continuous variables of population data

Description

Simulate continuous variables of population data using multinomial log-linear models combined with random draws from the resulting categories or (two-step) regression models combined with random error terms. The household structure of the population data and any other categorical predictors need to be simulated beforehand.

Usage

simContinuous(
  simPopObj,
  additional = "netIncome",
  method = c("multinom", "lm", "poisson", "xgboost"),
  zeros = TRUE,
  breaks = NULL,
  lower = NULL,
  upper = NULL,
  equidist = TRUE,
  probs = NULL,
  gpd = TRUE,
  threshold = NULL,
  est = "moments",
  limit = NULL,
  censor = NULL,
  log = TRUE,
  const = NULL,
  alpha = 0.01,
  residuals = TRUE,
  keep = TRUE,
  maxit = 500,
  MaxNWts = 1500,
  tol = .Machine$double.eps^0.5,
  nr_cpus = NULL,
  eps = NULL,
  regModel = "basic",
  byHousehold = NULL,
  imputeMissings = FALSE,
  seed,
  verbose = FALSE,
  by = "strata",
  model_params = NULL
)

Arguments

`simPopObj`	a `simPopObj` holding household survey data, population data and optionally some margins.
`additional`	a character string specifying the additional continuous variable of `dataS` that should be simulated for the population data. Currently, only one additional variable can be simulated at a time.
`method`	a character string specifying the method to be used for simulating the continuous variable. Accepted values are `"multinom"`, for using multinomial log-linear models combined with random draws from the resulting categories, `"lm"`, for using (two-step) regression models combined with random error terms, `"poisson"` for using Poisson regression for count variables, and `"xgboost"` for using XGBoost.
`zeros`	a logical indicating whether the variable specified by `additional` is semi-continuous, i.e., contains a considerable amount of zeros. If `TRUE` and `method` is `"multinom"`, a separate factor level for zeros in the response is used. If `TRUE` and `method` is `"lm"`, a two-step model is applied. The first step thereby uses a log-linear or multinomial log-linear model (see “Details”).
`breaks`	an optional numeric vector; if multinomial models are computed, this can be used to supply two or more break points for categorizing the variable specified by `additional`. If `NULL`, break points are computed using weighted quantiles.
`lower`, `upper`	optional numeric values; if multinomial models are computed and `breaks` is `NULL`, these can be used to specify lower and upper bounds other than minimum and maximum, respectively. Note that if `method` is `"multinom"` and `gpd` is `TRUE` (see below), `upper` defaults to `Inf`.
`equidist`	logical; if `method` is `"multinom"` and `breaks` is `NULL`, this indicates whether the (positive) default break points should be equidistant or whether there should be refinements in the lower and upper tail (see `getBreaks`).
`probs`	numeric vector with values in `[0, 1]`; if `method` is `"multinom"` and `breaks` is `NULL`, this gives probabilities for quantiles to be used as (positive) break points. If supplied, this is preferred over `equidist`.
`gpd`	logical; if `method` is `"multinom"`, this indicates whether the upper tail of the variable specified by `additional` should be simulated by random draws from a (truncated) generalized Pareto distribution rather than a uniform distribution.
`threshold`	a numeric value; if `method` is `"multinom"`, values for categories above `threshold` are drawn from a (truncated) generalized Pareto distribution.
`est`	a character string; if `method` is `"multinom"`, the estimator to be used to fit the generalized Pareto distribution.
`limit`	an optional named list of lists; if multinomial models are computed, this can be used to account for structural zeros. The names of the list components specify the predictor variables for which to limit the possible outcomes of the response. For each predictor, a list containing the possible outcomes of the response for each category of the predictor can be supplied. The probabilities of other outcomes conditional on combinations that contain the specified categories of the supplied predictors are set to 0. Currently, this is only implemented for more than two categories in the response.
`censor`	an optional named list of lists or `data.frame`s; if multinomial models are computed, this can be used to account for structural zeros. The names of the list components specify the categories that should be censored. For each of these categories, a list or `data.frame` containing levels of the predictor variables can be supplied. The probability of the specified categories is set to 0 for the respective predictor levels. Currently, this is only implemented for more than two categories in the response.
`log`	logical; if `method` is `"lm"`, this indicates whether the linear model should be fitted to the logarithms of the variable specified by `additional`. The predicted values are then back-transformed with the exponential function. See “Details” for more information.
`const`	numeric; if `method` is `"lm"` and `log` is `TRUE`, this gives a constant to be added before log transformation.
`alpha`	numeric; if `method` is `"lm"`, this gives trimming parameters for the sample data. Trimming is thereby done with respect to the variable specified by `additional`. If a numeric vector of length two is supplied, the first element gives the trimming proportion for the lower part and the second element the trimming proportion for the upper part. If a single numeric is supplied, it is used for both. With `NULL`, trimming is suppressed.
`residuals`	logical; if `method` is `"lm"`, this indicates whether the random error terms should be obtained by draws from the residuals. If `FALSE`, they are drawn from a normal distribution (median and MAD of the residuals are used as parameters).
`keep`	logical; if multinomial models are computed, this indicates whether the simulated categories should be stored as a variable in the resulting population data. If `TRUE`, the corresponding column name is given by `additional` with postfix `"Cat"`.
`maxit`, `MaxNWts`	control parameters to be passed to `multinom` and `nnet`. See the help file for `nnet`.
`tol`	if `method` is `"lm"` and `zeros` is `TRUE`, a small positive numeric value or `NULL`. When fitting a log-linear model within a stratum, factor levels may not exist in the sample but are likely to exist in the population. However, the coefficient for such factor levels will be 0. Therefore, coefficients smaller than `tol` in absolute value are replaced by coefficients from an auxiliary model that is fit to the whole sample. If `NULL`, no auxiliary log-linear model is computed and no coefficients are replaced.
`nr_cpus`	if specified, an integer number defining the number of cpus that should be used for parallel processing.
`eps`	a small positive numeric value, or `NULL` (the default). In the former case and if (multinomial) log-linear models are computed, estimated probabilities smaller than this are assumed to result from structural zeros and are set to exactly 0.
`regModel`	allows to specify the model that should be for the simulation of the additional continuous variable. The following choices are possible: 'basic'only the basic household-variables (generated with `simStructure`) are used. 'available'all available variables (that are common in the sample and the syntetic population (e.g. previously generated variables) are used for the modeling. Should be used with care because all variables are automatically used as factors! formula-object: Users may also specify a specific formula (class 'formula') that will be used. Checks are performed that all required variables are available.
`byHousehold`	if NULL, simulated values are used as is. If either `'sum'`, `'mean'` or `'random'` is specified, the values are aggregated and each member of the household gets the same value (mean, sum or a random value) assigned.
`imputeMissings`	if TRUE, missing values in variables that are used for the underlying model are imputed using hock-deck.
`seed`	optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored.
`verbose`	(logical) if `TRUE`, additional output is written to the promt
`by`	defining which variable to use as split up variable of the estimation. Defaults to the strata variable.
`model_params`	adding optional parameter to the model, at the moment only implemented for xgboost hyperparameters

Details

If method is "lm", the behavior for two-step models is described in the following.

If zeros is TRUE and log is not TRUE or the variable specified by additional does not contain negative values, a log-linear model is used to predict whether an observation is zero or not. Then a linear model is used to predict the non-zero values.

If zeros is TRUE, log is TRUE and const is specified, again a log-linear model is used to predict whether an observation is zero or not. In the linear model to predict the non-zero values, const is added to the variable specified by additional before the logarithms are taken.

If zeros is TRUE, log is TRUE, const is NULL and there are negative values, a multinomial log-linear model is used to predict negative, zero and positive observations. Categories for the negative values are thereby defined by breaks. In the second step, a linear model is used to predict the positive values and negative values are drawn from uniform distributions in the respective classes.

If zeros is FALSE, log is TRUE and const is NULL, a two-step model is used if there are non-positive values in the variable specified by additional. Whether a log-linear or a multinomial log-linear model is used depends on the number of categories to be used for the non-positive values, as defined by breaks. Again, positive values are then predicted with a linear model and non-positive values are drawn from uniform distributions.

The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.

Value

An object of class simPopObj containing survey data as well as the simulated population data including the continuous variable specified by additional and possibly simulated categories for the desired continous variable.

Note

The basic household structure and any other categorical predictors need to be simulated beforehand with the functions simStructure and simCategorical, respectively.

Author(s)

Bernhard Meindl, Andreas Alfons, Alexander Kowarik (based on code by Stefan Kraft), Siro Fritzmann

References

B. Meindl, M. Templ, A. Kowarik, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v079.i10")}

A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/02664763.2013.859237")}

Examples


data(eusilcS)
## Not run: 
## approx. 20 seconds computation time
inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090")
simPop <- simStructure(data=inp, method="direct",
  basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a"))

regModel = ~rb090+hsize+pl030+pb220a

# multinomial model with random draws
eusilcM <- simContinuous(simPop, additional="netIncome",
              regModel = regModel,
              upper=200000, equidist=FALSE, nr_cpus=1)
class(eusilcM)

# two-step regression
eusilcT <- simContinuous(simPop, additional="netIncome",
              regModel = "basic",
              method = "lm", nr_cpus=1)
class(eusilcT)

## End(Not run)

statistikat/simPop documentation built on April 13, 2025, 12:59 a.m.