train | R Documentation |
Train a fusion model on "donor" data using sequential LightGBM models to model the conditional distributions. The resulting fusion model (.fsn file) can be used with fuse
to simulate outcomes for a "recipient" dataset.
train(
data,
y,
x,
fsn = "fusion_model.fsn",
weight = NULL,
nfolds = 5,
nquantiles = 2,
nclusters = 2000,
krange = c(10, 500),
hyper = NULL,
fork = FALSE,
cores = 1
)
data |
Data frame. Donor dataset. Categorical variables must be factors and ordered whenever possible. |
y |
Character or list. Variables in |
x |
Character or list. Predictor variables in |
fsn |
Character. File path where fusion model will be saved. Must use |
weight |
Character. Name of the observation weights column in |
nfolds |
Numeric. Number of cross-validation folds used for LightGBM model training. Or, if |
nquantiles |
Numeric. Number of quantile models to train for continuous |
nclusters |
Numeric. Maximum number of k-means clusters to use. Higher is better but at computational cost. |
krange |
Numeric. Minimum and maximum number of nearest neighbors to use for construction of continuous conditional distributions. Higher |
hyper |
List. LightGBM hyperparameters to be used during model training. If |
fork |
Logical. Should parallel processing via forking be used, if possible? See Details. |
cores |
Integer. Number of physical CPU cores used for parallel computation. When |
When y
is a list, each slot indicates either a single variable or, alternatively, multiple variables to fuse as a block. Variables within a block are sampled jointly from the original donor data during fusion. See Examples.
y
variables that exhibit no variance or continuous y
variables with less than 10 * nfolds
non-zero observations (minimum required for cross-validation) are automatically removed with a warning.
The fusion model written to fsn
is a zipped archive created by zip
containing models and data required by fuse
.
The hyper
argument can be used to specify the LightGBM hyperparameter values over which to perform a "grid search" during model training. See here for the full list of parameters. For each combination of hyperparameters, nfolds
cross-validation is performed using lgb.cv
with an early stopping condition. The parameter combination with the lowest loss function value is used to fit the final model via lgb.train
. The more candidate parameter values specified in hyper
, the longer the processing time. If hyper = NULL
, a single set of parameters is used with the following default values:
boosting = "gbdt"
data_sample_strategy = "goss"
num_leaves = 31
feature_fraction = 0.8
max_depth = 5
min_data_in_leaf = max(10, round(0.001 * nrow(data)))
num_iterations = 2500
learning_rate= 0.1
max_bin = 255
min_data_in_bin = 3
max_cat_threshold = 32
Typical users will only have reason to modify the hyperparameters listed above. Note that num_iterations
only imposes a ceiling, since early stopping will typically result in models with a lower number of iterations. See Examples.
Testing with small-to-medium size datasets suggests that forking is typically faster than OpenMP multithreading (the default). However, forking will sometimes "hang" (continue to run with no CPU usage or error message) if an OpenMP process has been previously used in the same session. The issue appears to be related to Intel's OpenMP implementation (see here). This can be triggered when other operations are called before train()
that use data.table
or fst
in multithread mode. If you experience hanged forking, try calling data.table::setDTthreads(1)
and fst::threads_fst(1)
immediately after library(fusionModel)
in a new session.
A fusion model object (.fsn) is saved to fsn
.
# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
?recs
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)
# When 'y' is a list, it can specify variables to fuse as a block
fusion.vars <- list("electricity", "natural_gas", c("heating_share", "cooling_share", "other_share"))
fusion.vars
train(data = recs, y = fusion.vars, x = predictor.vars)
# When 'x' is a list, it specifies which predictor variables to use for each 'y'
xlist <- list(predictor.vars[1:4], predictor.vars[2:8], predictor.vars)
xlist
train(data = recs, y = fusion.vars, x = xlist)
# Specify a single set of LightGBM hyperparameters
# Here we use Random Forests instead of the default Gradient Boosting Decision Trees
train(data = recs, y = fusion.vars, x = predictor.vars,
hyper = list(boosting = "rf",
feature_fraction = 0.6,
max_depth = 10
))
# Specify a range of LightGBM hyperparameters to search over
# This takes longer, because there are more models to test
train(data = recs, y = fusion.vars, x = predictor.vars,
hyper = list(max_depth = c(5, 10),
feature_fraction = c(0.7, 0.9)
))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.