impute.learn.rfsrc: Learn a predictive imputer for test-time imputation

View source: R/impute.learn.rfsrc.R

impute.learn.rfsrcR Documentation

Learn a predictive imputer for test-time imputation

Description

Learns a predictive imputer from training data for later use on new data.

If the training data contain missing values, the function first imputes them using impute. It then fits one saved full-sweep learner per selected target on the completed training data and reuses those learners later to update missing values in new data without refitting on the test set.

If the training data are complete and target.mode = "all", the initial training-data imputation step is skipped and the full-sweep learners are fit directly from the complete training data.

Usage

impute.learn.rfsrc(formula, data,
  ntree = 100, nodesize = 1, nsplit = 10,
  nimpute = 2, fast = FALSE, blocks,
  mf.q, max.iter = 10, eps = 0.01,
  ytry = NULL, always.use = NULL, verbose = TRUE,
  ...,
  full.sweep.options = list(ntree = 100, nsplit = 10),
  target.mode = c("missing.only", "all"),
  deployment.xvars = NULL,
  anonymous = TRUE,
  learner.prefix = "impute.learner.",
  learner.root = "learners",
  out.dir = NULL,
  wipe = TRUE,
  keep.models = is.null(out.dir),
  keep.ximp = FALSE,
  save.on.fit = !is.null(out.dir))

save.impute.learn.rfsrc(object, path, wipe = TRUE, verbose = TRUE)

load.impute.learn.rfsrc(path, targets = NULL, lazy = TRUE, verbose = TRUE)

## S3 method for class 'impute.learn.rfsrc'
predict(object, newdata,
  max.predict.iter = 3L,
  eps = 1e-3,
  targets = NULL,
  restore.integer = TRUE,
  cache.learners = c("session", "none", "all"),
  verbose = TRUE,
  ...)

Arguments

formula

A symbolic model description. Can be omitted. The same interpretation as in impute is used for the initial training-data imputation stage. The saved full-sweep learner bank is controlled by deployment.xvars, not by formula.

data

Training data. Variables that are not real-valued are coerced to factors before fitting when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are dropped before the training schema is recorded.

ntree, nodesize, nsplit, nimpute, fast, blocks, max.iter, ytry, always.use, verbose

Arguments passed to impute for the initial training-data imputation. The argument full.sweep is controlled internally and should not be supplied.

mf.q

Controls the imputation engine used by impute. If mf.q = 1, training uses standard missForest. If mf.q > 1, training uses the multivariate missForest generalization. If mf.q is omitted, the training imputation follows the default behavior of impute.

eps

Convergence threshold. In impute.learn this controls the initial training-data imputation. In predict.impute.learn it controls early stopping for the prediction-time sweep.

...

For impute.learn, additional arguments passed to impute. For predict.impute.learn, additional arguments are currently ignored.

full.sweep.options

A list of options used when fitting the full sweep after the training data have been imputed. Recognized entries include ntree, nodesize, nsplit, mtry, splitrule, bootstrap, sampsize, samptype, perf.type, rfq, save.memory, importance, and proximity. Unknown entries are ignored with a warning.

target.mode

Determines which variables receive a saved full-sweep learner. The default "missing.only" saves learners only for variables that were missing in the training data. The option "all" saves a learner for every variable. If the training data are complete, target.mode = "all" must be used.

deployment.xvars

Controls which predictors are assumed to be available later when the saved imputer is used on new data. If NULL, all columns except the target are used. If a character vector, the same predictor set is used for all targets. If a named list, each target can have its own predictor set. By default, all non-target columns are eligible predictors, so users should exclude outcomes, future information, identifiers, or any variables that will not be available at deployment time.

anonymous

If TRUE, uses rfsrc.anonymous when fitting the full sweep. This usually reduces the size of the saved object.

learner.prefix, learner.root

Names used when writing saved full-sweep learners to disk.

out.dir

Optional output directory. If supplied and save.on.fit = TRUE, the manifest and the saved full-sweep learners are written to this directory during fitting. This requires the fst package because learners are serialized with fast.save.

wipe

If TRUE, removes an existing output directory before writing a new one.

keep.models

If TRUE, keeps the fitted full-sweep learners in memory in the returned object. At least one storage mode must be enabled: either keep.models = TRUE or out.dir with save.on.fit = TRUE.

keep.ximp

If TRUE, keeps the completed training data in the returned object. This is not required for later prediction.

save.on.fit

If TRUE and out.dir is supplied, writes the imputer to disk during fitting.

object

An object returned by impute.learn or load.impute.learn.

path

Directory containing a saved imputer. Save and load operations require the fst package because learners are read and written with fast.save and fast.load.

targets

Optional subset of target variables to load or to update during prediction. Unknown names are ignored with a warning.

lazy

If TRUE, saved learners are loaded only when they are needed. If FALSE, all saved learners are loaded at once.

newdata

New data to be imputed. Missing columns are added and extra columns are dropped to match the training schema. Unseen factor levels are converted to NA and then treated as missing values during initialization and imputation.

max.predict.iter

Maximum number of full-sweep passes applied to newdata.

restore.integer

If TRUE, integer columns in the returned data are rounded and restored as integers. Factor columns are always conformed back to the training schema. The package operates on real-valued and factor variables; inputs that are not real-valued are coerced to factors during preprocessing when possible, otherwise an error is raised.

cache.learners

How saved learners are reused during prediction. The default "session" loads each needed learner once per call to predict. The option "none" reloads a learner every time it is needed. The option "all" loads all saved learners before prediction starts.

Details

This function fits a predictive imputer in two stages.

The training data are first normalized to a data frame. Variables that are not real-valued are coerced to factors when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are removed before the training schema is stored.

If the resulting training data contain missing values, the first stage uses impute to complete the training data. The imputation engine is chosen in exactly the same way as for impute itself. In particular, mf.q = 1 gives standard missForest, mf.q > 1 gives the multivariate missForest generalization, and if mf.q is omitted the default impute behavior is used. If the training data are already complete and target.mode = "all", this initial imputation step is skipped.

In the second stage, a full sweep is fit on the completed training data. For each target selected by target.mode, rows where that target was observed are used to fit a forest with that target on the left-hand side and the selected deployment predictors on the right-hand side. The saved learner bank therefore depends on deployment.xvars. The formula argument affects the initial training-data imputation step, but it does not define the saved predictor bank for the later test-time sweep.

By default, deployment.xvars = NULL allows every non-target column to be used as a predictor. This is convenient, but it can also introduce leakage if the training data include outcomes, future-only variables, identifiers, or any fields that will not be available when the learned imputer is applied to new data. Restrict deployment.xvars when that is a concern.

When the imputer is saved to disk, each full-sweep learner is written separately using fast.save. Loading uses fast.load. In practice this gives a small manifest plus a directory of saved learners. The fst package is therefore required for save and load operations. The explicit save method can write learners either from memory or by reloading them from an attached saved path.

Prediction starts by matching newdata to the training schema, filling missing values with training means or modes, and then applying one or more full-sweep passes. Only the targets selected by target.mode are updated by saved learners.

If target.mode = "missing.only", a variable that was complete in training but missing in new data is initialized from the training fit but does not receive a model-based update. Use target.mode = "all" if missing values may appear later in any variable. Complete training data also require target.mode = "all", because otherwise there are no missing variables from which to determine the saved targets.

Value

impute.learn returns an object of class c("impute.learn.rfsrc", "impute.learn"). The object contains a manifest, optionally the fitted full-sweep learners, optionally the completed training data, and optionally a path to the saved imputer on disk.

load.impute.learn returns an object of the same class.

predict.impute.learn returns a data frame with imputed values overlaid. An attribute named "impute.learn.info" contains prediction-time diagnostics such as the number of sweep passes, pass-difference history, caching mode, disk-load counts, schema harmonization details, and any targets skipped because a learner was unavailable or a prediction failed.

Author(s)

Hemant Ishwaran and Udaya B. Kogalur

References

Stekhoven D.J. and Buhlmann P. (2012). MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.

Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363–377.

See Also

impute.rfsrc, rfsrc, and predict.rfsrc.

Examples


## ------------------------------------------------------------
## small data example: uses missForest for impute engine
## ------------------------------------------------------------

set.seed(101)
aq <- airquality[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")]
aq$Month <- factor(aq$Month)

id <- sample(1:nrow(aq), 100)
train <- aq[id, ]
test <- aq[-id, ]

fit <- impute.learn(
  data = train,
  ntree = 25,
  mf.q = 1,
  max.iter = 5,
  full.sweep.options = list(ntree = 25, nsplit = 5)
)

test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
head(test.imp)


## Not run: 
## ------------------------------------------------------------
## Save the learned imputer to disk and load it later.
## This explicit save example writes learners kept in memory.
## Uses missForest for the impute engine.
## ------------------------------------------------------------

bundle.dir <- file.path(tempdir(), "aq.imputer")

fit <- impute.learn(
  data = train,
  ntree = 25,
  mf.q = 1,
  max.iter = 5,
  full.sweep.options = list(ntree = 25, nsplit = 5),
  keep.models = TRUE,
  verbose = FALSE
)

save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)

unlink(bundle.dir, recursive = TRUE)



## ------------------------------------------------------------
## Challenging example with factors, uses save/reload
## ------------------------------------------------------------

## load pbc, convert everything to factors
data(pbc, package = "randomForestSRC")
dta <- data.frame(lapply(pbc, factor))
dta$days <- pbc$days
dta$status <- dta$status

## split the data into unbalanced train/test data (25/75)
## the train/test data have the same levels, but different labels
idx <- sample(1:nrow(dta), round(nrow(dta) * .25))
train <- dta[idx,]
test <- dta[-idx,]

## even harder ... factor level not previously encountered in training
levels(test$stage) <- c(levels(test$stage), "fake")
test$stage[sample(seq_len(nrow(test)), 10)] <- "fake"

## train forest
fit <- suppressWarnings(impute.learn(Surv(days, status) ~ ., train, keep.models = TRUE))

## save/reload
bundle.dir <- file.path(tempdir(), "pbc.imputer")
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
print(summary(test.imp))
unlink(bundle.dir, recursive = TRUE)

## End(Not run)

randomForestSRC documentation built on March 25, 2026, 5:08 p.m.