| gg_isopro | R Documentation |
Pulls per-observation anomaly scores out of a isopro
fit so you can plot them, sort them, or write them to disk without having
to know the internal shape of the fit.
gg_isopro(object, ..., newdata = NULL)
object |
An |
... |
Currently unused. Present before |
newdata |
Optional |
A data.frame of class c("gg_isopro", "data.frame"),
one row per observation. Columns:
Integer; observation index 1..n, in the same
order as the rows of the data passed to
isopro.
Numeric; mean isolation depth across the forest. Lower means the observation was isolated quickly, so more anomalous.
Numeric in [0, 1]; the case.depth
values pushed through their own empirical CDF and flipped so
higher means more anomalous. This is the score the plot method
draws by default.
A provenance attribute records
source = "varPro::isopro", the observation count n, and
the number of trees ntree.
An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown on very small subsamples of the data and asked to split until each observation lands in its own terminal node. The intuition is geometric: a typical observation sits in the dense middle of the feature cloud and takes many splits to isolate, while an unusual observation sits out near an edge and gets cut off after only a few. So the depth at which an observation is isolated is a proxy for how typical it is: shallow depth means anomalous, deep depth means ordinary. Average a single observation's depth across many trees and the noise washes out, leaving a stable per-observation rank.
isopro supports three flavours of isolation
forest, which differ in how the splits are chosen:
"rnd"The original Liu/Ting/Zhou method: each tree node picks a variable at random and a split point uniformly at random in the variable's range. Fast, no model, surprisingly effective.
"unsupv"Unsupervised splitting from
randomForestSRC: splits are chosen to separate the data
along the directions of highest variance. More structured than
"rnd"; sometimes more accurate, especially when the
anomalies follow a coherent direction.
"auto"An auto-encoder formulation that grows a multivariate forest predicting each feature from the others. Most expressive, slowest, best suited to low-dimensional data.
No method is universally best. The varPro authors recommend trying at least two and comparing the score distributions; the plot method here colours per-method curves automatically when you stack the results.
The fit gives back two parallel per-observation vectors:
case.depth is the raw mean isolation depth (units of "splits",
lower = more anomalous) and howbad is the same information
transformed onto a [0, 1] scale via the empirical CDF of
case.depth (higher = more anomalous). Both columns are kept so
you can plot in either space and have the raw depth on hand for
diagnostics; howbad is the canonical score and is what the plot
method uses by default.
This is screening, not inference. Reach for it when you want to:
flag observations that may be data-entry errors, out-of-range measurements, or distinct subpopulations before fitting a primary model;
check whether a held-out cohort sits inside the training distribution before scoring with a model trained elsewhere;
give the analyst a ranked list of "look at these first" cases for a manual review;
score a held-out cohort or a fresh batch of incoming data against a fitted model and compare the test scores to the training distribution.
The score is a rank, not a probability of being an outlier: two
observations with howbad = 0.92 are both unusual, not "92\
likely to be anomalous". Pick a cutoff by looking at where the elbow
rises; plot.gg_isopro can annotate either a score
(threshold) or a top-percent (top_n_pct) for you.
Pass a data.frame as newdata and the extractor calls
predict.isopro twice: once with
quantiles = FALSE to get the raw mean case depth per row, and once
with quantiles = TRUE to get the per-row quantile of that depth
against the training-data depth distribution.
varPro's predict.isopro returns quantiles where smaller is
more anomalous, which is the opposite polarity of the wrapper's
howbad (where higher is more anomalous). The wrapper
exposes both conventions so nothing is hidden:
case.depth carries varPro's native polarity, lower
= more anomalous. This is the unmodified output of
predict(object, newdata, quantiles = FALSE). Use it to
cross-reference against raw varPro output.
howbad is the flipped, wrapper-convention version. The
relationship is howbad = 1 - predict(object, newdata, quantiles = TRUE).
To overlay training and test scores in one plot, bind the two extractor
calls with a method label column (the same column
plot.gg_isopro uses to colour rnd / unsupv / auto
comparisons):
gg_train <- gg_isopro(fit)
gg_test <- gg_isopro(fit, newdata = test_df)
gg_both <- rbind(cbind(gg_train, method = "train"),
cbind(gg_test, method = "test"))
class(gg_both) <- c("gg_isopro", "data.frame")
plot(gg_both)
To compare methods ("rnd", "unsupv", "auto"), call
gg_isopro on each fit and dplyr::bind_rows() the
results with a method label column. The plot method auto-detects
method and colours the curves.
Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest. Eighth IEEE International Conference on Data Mining, 413-422.
Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent Variable Selection via the Rule-Based Variable Priority Framework. R package version 3.x.
plot.gg_isopro, isopro
if (requireNamespace("varPro", quietly = TRUE)) {
set.seed(1)
fit <- varPro::isopro(data = iris[, 1:4], method = "rnd",
sampsize = 32, ntree = 50)
gg <- gg_isopro(fit)
plot(gg)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.