Nothing
####**********************************************************************
#### gg_isopro: tidy extractor for varPro::isopro anomaly scores.
####
#### varPro::isopro returns a list with $howbad (per-observation anomaly
#### score in [0,1]) and $case.depth (average isolation depth, lower =
#### more anomalous). gg_isopro() reshapes these into a tidy data.frame
#### the plot/print/summary methods can consume.
####**********************************************************************
#' Tidy data from a varPro isolation-forest fit
#'
#' Pulls per-observation anomaly scores out of a \code{\link[varPro]{isopro}}
#' fit so you can plot them, sort them, or write them to disk without having
#' to know the internal shape of the fit.
#'
#' @section What isopro is doing:
#' An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown
#' on very small subsamples of the data and asked to split until each
#' observation lands in its own terminal node. The intuition is geometric:
#' a typical observation sits in the dense middle of the feature cloud and
#' takes many splits to isolate, while an unusual observation sits out
#' near an edge and gets cut off after only a few. So \strong{the depth at
#' which an observation is isolated is a proxy for how typical it is}:
#' shallow depth means anomalous, deep depth means ordinary. Average a
#' single observation's depth across many trees and the noise washes out,
#' leaving a stable per-observation rank.
#'
#' \code{\link[varPro]{isopro}} supports three flavours of isolation
#' forest, which differ in how the splits are chosen:
#' \describe{
#' \item{\code{"rnd"}}{The original Liu/Ting/Zhou method: each tree node
#' picks a variable at random and a split point uniformly at random
#' in the variable's range. Fast, no model, surprisingly effective.}
#' \item{\code{"unsupv"}}{Unsupervised splitting from
#' \code{randomForestSRC}: splits are chosen to separate the data
#' along the directions of highest variance. More structured than
#' \code{"rnd"}; sometimes more accurate, especially when the
#' anomalies follow a coherent direction.}
#' \item{\code{"auto"}}{An auto-encoder formulation that grows a
#' multivariate forest predicting each feature from the others. Most
#' expressive, slowest, best suited to low-dimensional data.}
#' }
#' No method is universally best. The varPro authors recommend trying at
#' least two and comparing the score distributions; the plot method here
#' colours per-method curves automatically when you stack the results.
#'
#' @section What's in the output:
#' The fit gives back two parallel per-observation vectors:
#' \code{case.depth} is the raw mean isolation depth (units of "splits",
#' lower = more anomalous) and \code{howbad} is the same information
#' transformed onto a \code{[0, 1]} scale via the empirical CDF of
#' \code{case.depth} (higher = more anomalous). Both columns are kept so
#' you can plot in either space and have the raw depth on hand for
#' diagnostics; \code{howbad} is the canonical score and is what the plot
#' method uses by default.
#'
#' @section What you use this for:
#' This is screening, not inference. Reach for it when you want to:
#' \itemize{
#' \item flag observations that may be data-entry errors, out-of-range
#' measurements, or distinct subpopulations before fitting a primary
#' model;
#' \item check whether a held-out cohort sits inside the training
#' distribution before scoring with a model trained elsewhere;
#' \item give the analyst a ranked list of "look at these first" cases
#' for a manual review;
#' \item score a held-out cohort or a fresh batch of incoming data
#' against a fitted model and compare the test scores to the training
#' distribution.
#' }
#' The score is a \emph{rank}, not a probability of being an outlier: two
#' observations with \code{howbad = 0.92} are both unusual, not "92\%
#' likely to be anomalous". Pick a cutoff by looking at where the elbow
#' rises; \code{\link{plot.gg_isopro}} can annotate either a score
#' (\code{threshold}) or a top-percent (\code{top_n_pct}) for you.
#'
#' @section Scoring new data:
#' Pass a \code{data.frame} as \code{newdata} and the extractor calls
#' \code{\link[varPro]{predict.isopro}} twice: once with
#' \code{quantiles = FALSE} to get the raw mean case depth per row, and once
#' with \code{quantiles = TRUE} to get the per-row quantile of that depth
#' against the training-data depth distribution.
#'
#' varPro's \code{predict.isopro} returns quantiles where \emph{smaller is
#' more anomalous}, which is the opposite polarity of the wrapper's
#' \code{howbad} (where \emph{higher} is more anomalous). The wrapper
#' exposes both conventions so nothing is hidden:
#' \itemize{
#' \item \code{case.depth} carries varPro's native polarity, \emph{lower
#' = more anomalous}. This is the unmodified output of
#' \code{predict(object, newdata, quantiles = FALSE)}. Use it to
#' cross-reference against raw varPro output.
#' \item \code{howbad} is the flipped, wrapper-convention version. The
#' relationship is \code{howbad = 1 - predict(object, newdata, quantiles = TRUE)}.
#' }
#'
#' To overlay training and test scores in one plot, bind the two extractor
#' calls with a \code{method} label column (the same column
#' \code{\link{plot.gg_isopro}} uses to colour rnd / unsupv / auto
#' comparisons):
#'
#' \preformatted{
#' gg_train <- gg_isopro(fit)
#' gg_test <- gg_isopro(fit, newdata = test_df)
#' gg_both <- rbind(cbind(gg_train, method = "train"),
#' cbind(gg_test, method = "test"))
#' class(gg_both) <- c("gg_isopro", "data.frame")
#' plot(gg_both)
#' }
#'
#' @param object An \code{isopro} fit returned by
#' \code{\link[varPro]{isopro}}.
#' @param ... Currently unused. Present before \code{newdata} so that
#' \code{newdata} is only matched by name, preserving backward
#' compatibility with callers of the PR #94 signature
#' \code{gg_isopro(object, ...)}.
#' @param newdata Optional \code{data.frame} of new observations to score
#' against the fit. Must be passed by name. When \code{NULL} (default)
#' the extractor returns the in-sample tidy frame from the fit's stored
#' \code{$case.depth} and \code{$howbad}. When supplied, each row is
#' scored via \code{\link[varPro]{predict.isopro}} and the same tidy
#' shape is returned for the test data.
#'
#' @return A \code{data.frame} of class \code{c("gg_isopro", "data.frame")},
#' one row per observation. Columns:
#' \describe{
#' \item{obs}{Integer; observation index \code{1..n}, in the same
#' order as the rows of the data passed to
#' \code{\link[varPro]{isopro}}.}
#' \item{case.depth}{Numeric; mean isolation depth across the forest.
#' Lower means the observation was isolated quickly, so more
#' anomalous.}
#' \item{howbad}{Numeric in \code{[0, 1]}; the \code{case.depth}
#' values pushed through their own empirical CDF and flipped so
#' higher means more anomalous. This is the score the plot method
#' draws by default.}
#' }
#' A \code{provenance} attribute records
#' \code{source = "varPro::isopro"}, the observation count \code{n}, and
#' the number of trees \code{ntree}.
#'
#' @section Comparing methods:
#' To compare methods (\code{"rnd"}, \code{"unsupv"}, \code{"auto"}), call
#' \code{\link{gg_isopro}} on each fit and \code{dplyr::bind_rows()} the
#' results with a \code{method} label column. The plot method auto-detects
#' \code{method} and colours the curves.
#'
#' @references
#' Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest.
#' \emph{Eighth IEEE International Conference on Data Mining}, 413-422.
#'
#' Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent
#' Variable Selection via the Rule-Based Variable Priority Framework.
#' \emph{R package version 3.x}.
#'
#' @seealso \code{\link{plot.gg_isopro}}, \code{\link[varPro]{isopro}}
#'
#' @examples
#' \donttest{
#' if (requireNamespace("varPro", quietly = TRUE)) {
#' set.seed(1)
#' fit <- varPro::isopro(data = iris[, 1:4], method = "rnd",
#' sampsize = 32, ntree = 50)
#' gg <- gg_isopro(fit)
#' plot(gg)
#' }
#' }
#'
#' @export
gg_isopro <- function(object, ..., newdata = NULL) {
UseMethod("gg_isopro", object)
}
#' @export
gg_isopro.isopro <- function(object, ..., newdata = NULL) {
if (!inherits(object, "isopro")) {
stop("gg_isopro expects a 'isopro' object from varPro::isopro().",
call. = FALSE)
}
ntree <- tryCatch(
as.integer(object$isoforest$ntree),
error = function(e) NA_integer_
)
ntree <- if (length(ntree) == 1L && !is.na(ntree)) ntree else NA_integer_
## ---- Training path (newdata = NULL) ------------------------------------
if (is.null(newdata)) {
# varPro's $howbad uses "lower = more anomalous" polarity (it is the
# quantile of case.depth, low depth = anomalous). The wrapper convention
# is "higher = more anomalous", so flip the polarity here the same way
# the prediction path does (howbad = 1 - quantile).
howbad <- 1 - as.numeric(object$howbad)
depth <- as.numeric(object$case.depth)
n <- length(howbad)
gg_dta <- data.frame(
obs = seq_len(n),
case.depth = depth,
howbad = howbad
)
class(gg_dta) <- c("gg_isopro", class(gg_dta))
attr(gg_dta, "provenance") <- list(
source = "varPro::isopro",
n = n,
ntree = ntree,
prediction = FALSE
)
return(invisible(gg_dta))
}
## ---- Prediction path (newdata supplied) -------------------------------
if (!is.data.frame(newdata)) {
stop("newdata must be a data.frame.", call. = FALSE)
}
# Two calls to predict.isopro: raw depth and quantile-against-training.
# The wrapper polarity is "higher = more anomalous", so we flip the quantile:
# howbad = 1 - predict(object, newdata, quantiles = TRUE)
# case.depth keeps varPro's native scale (lower = more anomalous), giving
# the user a varPro-polarity number for cross-reference.
depth <- as.numeric(stats::predict(object, newdata = newdata,
quantiles = FALSE))
q <- as.numeric(stats::predict(object, newdata = newdata,
quantiles = TRUE))
howbad <- 1 - q
n <- nrow(newdata)
gg_dta <- data.frame(
obs = seq_len(n),
case.depth = depth,
howbad = howbad
)
class(gg_dta) <- c("gg_isopro", class(gg_dta))
attr(gg_dta, "provenance") <- list(
source = "varPro::isopro",
n = n,
ntree = ntree,
prediction = TRUE
)
invisible(gg_dta)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.