train: Train a fusion model
In ummel/fusionModel: Data fusion and analysis of synthetic data in R

train

R Documentation

Train a fusion model

Description

Train a fusion model on "donor" data using sequential LightGBM models to model the conditional distributions. The resulting fusion model (.fsn file) can be used with fuse to simulate outcomes for a "recipient" dataset.

Usage

train(
  data,
  y,
  x,
  fsn = "fusion_model.fsn",
  weight = NULL,
  nfolds = 5,
  nquantiles = 2,
  nclusters = 2000,
  krange = c(10, 500),
  hyper = NULL,
  fork = FALSE,
  cores = 1
)

Arguments

`data`	Data frame. Donor dataset. Categorical variables must be factors and ordered whenever possible.
`y`	Character or list. Variables in `data` to eventually fuse to a recipient dataset. Variables are fused in the order provided. If `y` is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.
`x`	Character or list. Predictor variables in `data` common to donor and eventual recipient. If a list, each slot specifies the `x` predictors to use for each `y`.
`fsn`	Character. File path where fusion model will be saved. Must use `.fsn` suffix.
`weight`	Character. Name of the observation weights column in `data`. If NULL (default), uniform weights are assumed.
`nfolds`	Numeric. Number of cross-validation folds used for LightGBM model training. Or, if `nfolds < 1`, the fraction of observations to use for training set; remainder used for validation (faster than cross-validation).
`nquantiles`	Numeric. Number of quantile models to train for continuous `y` variables, in addition to the conditional mean. `nquantiles` evenly-distributed percentiles are used. For example, the default `nquantiles = 2` yields quantile models for the 25th and 75th percentiles. Higher values may produce more accurate conditional distributions at the expense of computation time. Even `nquantiles` is recommended since the conditional mean tends to capture the central tendency, making a median model superfluous.
`nclusters`	Numeric. Maximum number of k-means clusters to use. Higher is better but at computational cost. `nclusters = 0` or `nclusters = Inf` turn off clustering.
`krange`	Numeric. Minimum and maximum number of nearest neighbors to use for construction of continuous conditional distributions. Higher `max(krange)` is better but at computational cost.
`hyper`	List. LightGBM hyperparameters to be used during model training. If `NULL`, default values are used. See Details and Examples.
`fork`	Logical. Should parallel processing via forking be used, if possible? See Details.
`cores`	Integer. Number of physical CPU cores used for parallel computation. When `fork = FALSE` or on Windows platform (since forking is not possible), the fusion variables/blocks are processed serially but LightGBM uses `cores` for internal multithreading via OpenMP. On a Unix system, if `fork = TRUE`, `cores > 1`, and `cores <= length(y)` then the fusion variables/blocks are processed in parallel via `mclapply`.

Details

When y is a list, each slot indicates either a single variable or, alternatively, multiple variables to fuse as a block. Variables within a block are sampled jointly from the original donor data during fusion. See Examples.

y variables that exhibit no variance or continuous y variables with less than 10 * nfolds non-zero observations (minimum required for cross-validation) are automatically removed with a warning.

The fusion model written to fsn is a zipped archive created by zip containing models and data required by fuse.

The hyper argument can be used to specify the LightGBM hyperparameter values over which to perform a "grid search" during model training. See here for the full list of parameters. For each combination of hyperparameters, nfolds cross-validation is performed using lgb.cv with an early stopping condition. The parameter combination with the lowest loss function value is used to fit the final model via lgb.train. The more candidate parameter values specified in hyper, the longer the processing time. If hyper = NULL, a single set of parameters is used with the following default values:

boosting = "gbdt"
data_sample_strategy = "goss"
num_leaves = 31
feature_fraction = 0.8
max_depth = 5
min_data_in_leaf = max(10, round(0.001 * nrow(data)))
num_iterations = 2500
learning_rate= 0.1
max_bin = 255
min_data_in_bin = 3
max_cat_threshold = 32

Typical users will only have reason to modify the hyperparameters listed above. Note that num_iterations only imposes a ceiling, since early stopping will typically result in models with a lower number of iterations. See Examples.

Testing with small-to-medium size datasets suggests that forking is typically faster than OpenMP multithreading (the default). However, forking will sometimes "hang" (continue to run with no CPU usage or error message) if an OpenMP process has been previously used in the same session. The issue appears to be related to Intel's OpenMP implementation (see here). This can be triggered when other operations are called before train() that use data.table or fst in multithread mode. If you experience hanged forking, try calling data.table::setDTthreads(1) and fst::threads_fst(1) immediately after library(fusionModel) in a new session.

Value

A fusion model object (.fsn) is saved to fsn.

Examples

# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
?recs
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)

# When 'y' is a list, it can specify variables to fuse as a block
fusion.vars <- list("electricity", "natural_gas", c("heating_share", "cooling_share", "other_share"))
fusion.vars
train(data = recs, y = fusion.vars, x = predictor.vars)

# When 'x' is a list, it specifies which predictor variables to use for each 'y'
xlist <- list(predictor.vars[1:4], predictor.vars[2:8], predictor.vars)
xlist
train(data = recs, y = fusion.vars, x = xlist)

# Specify a single set of LightGBM hyperparameters
# Here we use Random Forests instead of the default Gradient Boosting Decision Trees
train(data = recs, y = fusion.vars, x = predictor.vars,
      hyper = list(boosting = "rf",
                   feature_fraction = 0.6,
                   max_depth = 10
      ))

# Specify a range of LightGBM hyperparameters to search over
# This takes longer, because there are more models to test
train(data = recs, y = fusion.vars, x = predictor.vars,
      hyper = list(max_depth = c(5, 10),
                   feature_fraction = c(0.7, 0.9)
      ))

ummel/fusionModel documentation built on June 1, 2025, 11 p.m.