View source: R/impute.learn.rfsrc.R
| impute.learn.rfsrc | R Documentation |
Learns a predictive imputer from training data for later use on new data.
If the training data contain missing values, the function first
imputes them using impute. It then fits one saved full-sweep
learner per selected target on the completed training data and reuses
those learners later to update missing values in new data without
refitting on the test set.
The same saved learner bank can also be used to score new data for out-of-distribution (OOD) behavior. Note that OOD scores are available even when new data have missing values. Each selected target is reconstructed from its saved conditional learner and compared with the observed value. Target-wise discrepancies are calibrated against a training reference calculated from out-of-bag predictions computed during the training.
If the training data are complete and target.mode = "all",
the initial training-data imputation step is skipped and the
full-sweep learners are fit directly from the complete training data.
impute.learn.rfsrc(formula, data,
ntree = 100, nodesize = 1, nsplit = 10,
nimpute = 2, fast = FALSE, blocks,
mf.q, max.iter = 10, eps = 0.01,
ytry = NULL, always.use = NULL, verbose = TRUE,
...,
full.sweep.options = list(ntree = 100, nsplit = 10),
target.mode = c("missing.only", "all"),
deployment.xvars = NULL,
anonymous = TRUE,
learner.prefix = "impute.learner.",
learner.root = "learners",
out.dir = NULL,
wipe = TRUE,
keep.models = is.null(out.dir),
keep.ximp = FALSE,
save.on.fit = !is.null(out.dir),
save.ood = TRUE,
weight = NULL)
save.impute.learn.rfsrc(object, path, wipe = TRUE, verbose = TRUE)
load.impute.learn.rfsrc(path, targets = NULL, lazy = TRUE, verbose = TRUE)
## S3 method for class 'impute.learn.rfsrc'
predict(object, newdata,
max.predict.iter = 3L,
eps = 1e-3,
targets = NULL,
restore.integer = TRUE,
cache.learners = c("session", "none", "all"),
verbose = TRUE,
...)
impute.ood.rfsrc(object, newdata,
targets = NULL,
max.predict.iter = 3L,
eps = 1e-3,
cache.learners = c("all", "session", "none"),
weight = NULL,
aggregate = c("bounded.product", "weighted.mean",
"weighted.lp", "weighted.lp.log", "top.k"),
aggregate.args = list(),
return.details = FALSE,
verbose = TRUE,
...)
formula |
A symbolic model description. Can be omitted. The same
interpretation as in |
data |
Training data. Variables that are not real-valued are coerced to factors before fitting when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are dropped before the training begins. |
ntree, nodesize, nsplit, nimpute, fast, blocks, max.iter, ytry, always.use, verbose |
Arguments passed to
|
mf.q |
Controls the imputation engine used by |
eps |
Convergence threshold. In |
... |
For |
full.sweep.options |
A |
target.mode |
Determines which variables receive a saved
full-sweep learner. The default |
deployment.xvars |
Controls which predictors are assumed to be
available later when the saved imputer is used on new data. If
|
anonymous |
If |
learner.prefix, learner.root |
Names used when writing saved full-sweep learners to disk. |
out.dir |
Optional output directory. If supplied and
|
wipe |
If |
keep.models |
If |
keep.ximp |
If |
save.on.fit |
If |
save.ood |
If |
object |
An object returned by |
path |
Directory containing a saved imputer. Save and load
operations require the fst package because learners are read
and written with |
targets |
Optional subset of target variables to load, update,
or score. Unknown names are ignored with a warning. For
|
lazy |
If |
newdata |
New data to be imputed or scored. Missing columns are
added and extra columns are dropped to match the training schema.
Unseen factor levels are converted to |
max.predict.iter |
Maximum number of full-sweep passes applied to
|
restore.integer |
If |
cache.learners |
How saved learners are reused during
prediction or OOD scoring. For |
weight |
Optional nonnegative target weights used for row-level OOD
aggregation. In |
aggregate |
Row-level aggregation metric used by
|
aggregate.args |
Optional list of tuning arguments for
|
return.details |
If |
A predictive imputer is calculated in two stages.
The training data are first normalized to a data frame. Variables that are not real-valued are coerced to factors when possible; otherwise fitting stops with an error. Rows and columns that are entirely missing are removed before the training schema is stored.
If the resulting training data contain missing values, the first
stage uses impute to complete the training data. The
imputation engine is chosen in exactly the same way as for
impute itself. In particular, mf.q = 1 gives standard
missForest, mf.q > 1 gives the multivariate
missForest generalization, and if mf.q is omitted the
default impute behavior is used. If the training data are
already complete and target.mode = "all", this initial
imputation step is skipped.
In the second stage, a full sweep is fit on the completed training
data. For each target selected by target.mode, rows where that
target was observed are used to fit a forest with that target on the
left-hand side and the selected deployment predictors on the
right-hand side. The saved learner bank therefore depends on
deployment.xvars. The formula argument affects the
initial training-data imputation step, but it does not define the
saved predictor bank for the later test-time sweep.
By default, deployment.xvars = NULL allows every non-target
column to be used as a predictor. This is convenient, but it can also
introduce leakage if the training data include outcomes, future-only
variables, identifiers, or any fields that will not be available when
the learned imputer is applied to new data. Restrict
deployment.xvars when that is a concern.
When the imputer is saved to disk, each full-sweep learner is written
separately using fast.save. Loading uses fast.load. In
practice this gives a small manifest plus a directory of saved
learners. The fst package is therefore required for save and
load operations. The explicit save method can write learners either
from memory or by reloading them from an attached saved path.
Prediction starts by matching newdata to the training schema,
filling missing values with training means or modes, and then
applying one or more full-sweep passes. Only the targets selected by
target.mode are updated by saved learners.
If target.mode = "missing.only", a variable that was complete
in training but missing in new data is initialized from the training
fit but does not receive a model-based update. Use
target.mode = "all" if missing values may appear later in any
variable. Complete training data also require
target.mode = "all", because otherwise there are no missing
variables from which to determine the saved targets.
If save.ood = TRUE, the fit also stores an OOD reference in the
manifest. For each saved target, the out-of-bag prediction from the
fitted learner is compared with the observed training value to form a
target-wise reconstruction discrepancy. Continuous and integer targets
use absolute reconstruction error. Factor targets prefer the negative
log predictive probability assigned to the observed class. When class
probabilities are unavailable, unordered factors fall back to a 0/1
mismatch score and ordered factors fall back to a scaled rank
distance.
The row-level OOD calibration stored at fit time is built by
aggregating the target-wise training scores with a weighted mean using
weight. If weight is omitted at fit time, all saved OOD
targets receive weight 1. If a named vector is supplied, entries are
matched by target name, omitted saved OOD targets receive weight 0,
and the resulting weighting scheme is carried forward in the manifest
for later deployment-time scoring.
impute.ood first completes the predictor side of newdata
using the same harmonization, initialization, and iterative sweep
logic used by predict.impute.learn. It then reconstructs each
requested target directly from its saved learner and compares the
reconstruction with the observed value. Raw target discrepancies are
converted to target-wise OOD scores using the saved target-specific
training references, which places continuous and factor targets on a
common scale.
The row-level OOD score combines those calibrated target-wise scores
over the targets that are both observed and scoreable for that row. By
default, impute.ood uses a bounded product rule, but the row
aggregate can be changed to weighted mean, a weighted L_p rule,
a log-tail weighted L_p rule, or a top-k rule. This makes
it possible to explore row scores that are more sensitive to sparse
but severe coordinate shifts. By default, impute.ood reuses
the same OOD weights saved during impute.learn, so a pipeline
can fix its weighting scheme once upstream and carry it forward
automatically.
A second component, score.percentile, is obtained by rebuilding
the row-level training reference from the saved target-wise training
OOD scores using the requested target subset, the active weight
vector, and the active row aggregate. This means percentile
calibration remains available when the user leaves the saved weights
in place, overrides them at test time, scores only a subset of the
saved OOD targets, or experiments with alternate row aggregates.
Unseen factor levels are tracked row-wise during harmonization.
Because such values are immediate anomalies relative to the training
schema, impute.ood flags those rows and assigns them the
maximum row-level score. If the unseen level occurs in a scored target
itself, the corresponding target-level discrepancy is also treated as
maximal.
impute.learn returns an object of class
c("impute.learn.rfsrc", "impute.learn"). The object
contains a manifest, optionally the fitted full-sweep learners,
optionally the completed training data, and optionally a path to the
saved imputer on disk. If save.ood = TRUE, the manifest also
contains an ood component storing compact target-wise OOD
references, the saved row-by-target training OOD score matrix used for
later percentile recalibration, and the default OOD aggregation
weights.
load.impute.learn returns an object of the same class.
predict.impute.learn returns a data frame with imputed values
overlaid. An attribute named "impute.learn.info" contains
prediction-time diagnostics such as the number of sweep passes,
pass-difference history, caching mode, disk-load counts, schema
harmonization details, row-wise unseen-factor flags, and any targets
skipped because a learner was unavailable or a prediction failed.
impute.ood returns an object of class
c("impute.ood.rfsrc", "impute.ood"). It is a list with the
following components:
score: the row-level aggregate of calibrated
target-wise OOD scores under the requested aggregate and
weight. Larger values indicate greater
out-of-distribution behavior. For aggregate = "weighted.lp.log",
this raw score is on a positive unbounded scale.
score.percentile: the percentile of score
relative to a row-level training reference rebuilt from the saved
target-wise training OOD scores for the requested targets,
weights, and row aggregate. For legacy fitted objects that do not
contain those saved
training scores, the original saved row-level reference is used
when possible; otherwise NA.
targets.used: the number of weighted targets that
contributed to each row-level score.
target.score: optional matrix of target-wise calibrated
OOD scores, returned when return.details = TRUE.
target.delta: optional matrix of raw target-wise
reconstruction discrepancies, returned when
return.details = TRUE.
info: a list of diagnostics including harmonization
details, row-wise unseen-factor flags, learner-loading
information, the active row aggregate and its arguments, whether
the saved row-level calibration was used, and any target-specific
issues.
Hemant Ishwaran and Udaya B. Kogalur
Stekhoven D.J. and Buhlmann P. (2012). MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363–377.
impute.rfsrc, rfsrc, and
predict.rfsrc.
## ------------------------------------------------------------
## small data example: uses missForest for impute engine
## ------------------------------------------------------------
set.seed(101)
aq <- airquality[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")]
aq$Month <- factor(aq$Month)
id <- sample(1:nrow(aq), 100)
train <- aq[id, ]
test <- aq[-id, ]
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5)
)
test.imp <- predict(fit, test, max.predict.iter = 2, verbose = FALSE)
head(test.imp)
## OOD scoring is most informative when every deployment-time
## variable can be reconstructed, so target.mode = "all" is recommended.
## Optional named OOD weights can also be supplied here. Any omitted
## targets receive weight 0, and the saved weights are reused
## automatically later by impute.ood().
ood.fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
target.mode = "all",
save.ood = TRUE,
full.sweep.options = list(ntree = 25, nsplit = 5),
verbose = FALSE
)
ood <- impute.ood(ood.fit, test, return.details = TRUE, verbose = FALSE)
head(ood$score)
head(ood$score.percentile)
## try a more spike-sensitive row aggregate
ood.lp <- impute.ood(ood.fit, test,
aggregate = "weighted.lp",
aggregate.args = list(p = 4),
verbose = FALSE)
head(ood.lp$score.percentile)
## Not run:
## ------------------------------------------------------------
## Save the learned imputer to disk and load it later.
## This explicit save example writes learners kept in memory.
## Uses missForest for the impute engine.
## ------------------------------------------------------------
bundle.dir <- file.path(tempdir(), "aq.imputer")
fit <- impute.learn(
data = train,
ntree = 25,
mf.q = 1,
max.iter = 5,
full.sweep.options = list(ntree = 25, nsplit = 5),
keep.models = TRUE,
verbose = FALSE
)
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
unlink(bundle.dir, recursive = TRUE)
## ------------------------------------------------------------
## Challenging example with factors, uses save/reload
## ------------------------------------------------------------
## load pbc, convert everything to factors
data(pbc, package = "randomForestSRC")
dta <- data.frame(lapply(pbc, factor))
dta$days <- pbc$days
dta$status <- dta$status
## split the data into unbalanced train/test data (25/75)
## the train/test data have the same levels, but different labels
idx <- sample(1:nrow(dta), round(nrow(dta) * .25))
train <- dta[idx,]
test <- dta[-idx,]
## even harder ... factor level not previously encountered in training
levels(test$stage) <- c(levels(test$stage), "fake")
test$stage[sample(seq_len(nrow(test)), 10)] <- "fake"
## train forest
fit <- suppressWarnings(
impute.learn(Surv(days, status) ~ ., train,
target.mode = "all",
save.ood = TRUE,
keep.models = TRUE)
)
## save/reload
bundle.dir <- file.path(tempdir(), "pbc.imputer")
save.impute.learn(fit, bundle.dir, verbose = FALSE)
imp <- load.impute.learn(bundle.dir, lazy = TRUE, verbose = FALSE)
test.imp <- predict(imp, test, max.predict.iter = 2, verbose = FALSE)
ood <- impute.ood(imp, test, return.details = TRUE, verbose = FALSE)
which(ood$info$unseen.rows)
print(summary(test.imp))
unlink(bundle.dir, recursive = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.