View source: R/impute.learn.rfsrc.R
| impute.learn.rfsrc | R Documentation |
Learns a predictive imputer from training data for later use on new data.
If the training data contain missing values, the function first
imputes them using impute. It then fits one saved full-sweep
learner per selected target on the completed training data and reuses
those learners later to update missing values in new data without
refitting on the test set.
If the training data are complete and target.mode = "all",
the initial training-data imputation step is skipped and the
full-sweep learners are fit directly from the complete training data.
impute.learn.rfsrc(formula, data,
ntree = 100, nodesize = 1, nsplit = 10,
nimpute = 2, fast = FALSE, blocks,
mf.q, max.iter = 10, eps = 0.01,
ytry = NULL, always.use = NULL, verbose = TRUE,
...,
full.sweep.options = list(ntree = 100, nsplit = 10),
target.mode = c("missing.only", "all"),
deployment.xvars = NULL,
anonymous = TRUE,
learner.prefix = "impute.learner.",
learner.root = "learners",
out.dir = NULL,
wipe = TRUE,
keep.models = is.null(out.dir),
keep.ximp = FALSE,
save.on.fit = !is.null(out.dir))
save.impute.learn.rfsrc(object, path, wipe = TRUE, verbose = TRUE)
load.impute.learn.rfsrc(path, targets = NULL, lazy = TRUE, verbose = TRUE)
## S3 method for class 'impute.learn.rfsrc'
predict(object, newdata,
max.predict.iter = 3L,
eps = 1e-3,
targets = NULL,
restore.integer = TRUE,
cache.learners = c("session", "none", "all"),
verbose = TRUE,
...)
formula |
A symbolic model description. Can be omitted. The same
interpretation as in |
data |
Training data. Variables that are not real-valued are coerced to factors before fitting when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are dropped before the training schema is recorded. |
ntree, nodesize, nsplit, nimpute, fast, blocks, max.iter, ytry, always.use, verbose |
Arguments passed to
|
mf.q |
Controls the imputation engine used by |
eps |
Convergence threshold. In |
... |
For |
full.sweep.options |
A |
target.mode |
Determines which variables receive a saved
full-sweep learner. The default |
deployment.xvars |
Controls which predictors are assumed to be
available later when the saved imputer is used on new data. If
|
anonymous |
If |
learner.prefix, learner.root |
Names used when writing saved full-sweep learners to disk. |
out.dir |
Optional output directory. If supplied and
|
wipe |
If |
keep.models |
If |
keep.ximp |
If |
save.on.fit |
If |
object |
An object returned by |
path |
Directory containing a saved imputer. Save and load
operations require the fst package because learners are read
and written with |
targets |
Optional subset of target variables to load or to update during prediction. Unknown names are ignored with a warning. |
lazy |
If |
newdata |
New data to be imputed. Missing columns are added and
extra columns are dropped to match the training schema. Unseen
factor levels are converted to |
max.predict.iter |
Maximum number of full-sweep passes applied to
|
restore.integer |
If |
cache.learners |
How saved learners are reused during
prediction. The default |
This function fits a predictive imputer in two stages.
The training data are first normalized to a data frame. Variables that are not real-valued are coerced to factors when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are removed before the training schema is stored.
If the resulting training data contain missing values, the first
stage uses impute to complete the training data. The
imputation engine is chosen in exactly the same way as for
impute itself. In particular, mf.q = 1 gives standard
missForest, mf.q > 1 gives the multivariate
missForest generalization, and if mf.q is omitted the
default impute behavior is used. If the training data are
already complete and target.mode = "all", this initial
imputation step is skipped.
In the second stage, a full sweep is fit on the completed training
data. For each target selected by target.mode, rows where that
target was observed are used to fit a forest with that target on the
left-hand side and the selected deployment predictors on the
right-hand side. The saved learner bank therefore depends on
deployment.xvars. The formula argument affects the
initial training-data imputation step, but it does not define the
saved predictor bank for the later test-time sweep.
By default, deployment.xvars = NULL allows every non-target
column to be used as a predictor. This is convenient, but it can also
introduce leakage if the training data include outcomes, future-only
variables, identifiers, or any fields that will not be available when
the learned imputer is applied to new data. Restrict
deployment.xvars when that is a concern.
When the imputer is saved to disk, each full-sweep learner is written
separately using fast.save. Loading uses fast.load. In
practice this gives a small manifest plus a directory of saved
learners. The fst package is therefore required for save and
load operations. The explicit save method can write learners either
from memory or by reloading them from an attached saved path.
Prediction starts by matching newdata to the training schema,
filling missing values with training means or modes, and then
applying one or more full-sweep passes. Only the targets selected by
target.mode are updated by saved learners.
If target.mode = "missing.only", a variable that was complete
in training but missing in new data is initialized from the training
fit but does not receive a model-based update. Use
target.mode = "all" if missing values may appear later in any
variable. Complete training data also require
target.mode = "all", because otherwise there are no missing
variables from which to determine the saved targets.
impute.learn returns an object of class
c("impute.learn.rfsrc", "impute.learn"). The object
contains a manifest, optionally the fitted full-sweep learners,
optionally the completed training data, and optionally a path to the
saved imputer on disk.
load.impute.learn returns an object of the same class.
predict.impute.learn returns a data frame with imputed values
overlaid. An attribute named "impute.learn.info" contains
prediction-time diagnostics such as the number of sweep passes,
pass-difference history, caching mode, disk-load counts, schema
harmonization details, and any targets skipped because a learner was
unavailable or a prediction failed.
Hemant Ishwaran and Udaya B. Kogalur
Stekhoven D.J. and Buhlmann P. (2012). MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363–377.
impute.rfsrc, rfsrc, and
predict.rfsrc.
## ------------------------------------------------------------
## small data example: uses missForest for impute engine
## ------------------------------------------------------------
set.seed(101)
aq <- airquality[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")]
aq$Month <- factor(aq$Month)
id <- sample(1:nrow(aq), 100)
train <- aq[id, ]
test <- aq[-id, ]
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5)
)
test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
head(test.imp)
## Not run:
## ------------------------------------------------------------
## Save the learned imputer to disk and load it later.
## This explicit save example writes learners kept in memory.
## Uses missForest for the impute engine.
## ------------------------------------------------------------
bundle.dir <- file.path(tempdir(), "aq.imputer")
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5),
keep.models = TRUE,
verbose = FALSE
)
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
unlink(bundle.dir, recursive = TRUE)
## ------------------------------------------------------------
## Challenging example with factors, uses save/reload
## ------------------------------------------------------------
## load pbc, convert everything to factors
data(pbc, package = "randomForestSRC")
dta <- data.frame(lapply(pbc, factor))
dta$days <- pbc$days
dta$status <- dta$status
## split the data into unbalanced train/test data (25/75)
## the train/test data have the same levels, but different labels
idx <- sample(1:nrow(dta), round(nrow(dta) * .25))
train <- dta[idx,]
test <- dta[-idx,]
## even harder ... factor level not previously encountered in training
levels(test$stage) <- c(levels(test$stage), "fake")
test$stage[sample(seq_len(nrow(test)), 10)] <- "fake"
## train forest
fit <- suppressWarnings(impute.learn(Surv(days, status) ~ ., train, keep.models = TRUE))
## save/reload
bundle.dir <- file.path(tempdir(), "pbc.imputer")
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
print(summary(test.imp))
unlink(bundle.dir, recursive = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.