R/gg_isopro.R
In ggRandomForests: Visually Exploring Random Forests

Documented in gg_isopro

####**********************************************************************
####  gg_isopro: tidy extractor for varPro::isopro anomaly scores.
####
####  varPro::isopro returns a list with $howbad (per-observation anomaly
####  score in [0,1]) and $case.depth (average isolation depth, lower =
####  more anomalous). gg_isopro() reshapes these into a tidy data.frame
####  the plot/print/summary methods can consume.
####**********************************************************************

#' Tidy data from a varPro isolation-forest fit
#'
#' Pulls per-observation anomaly scores out of a \code{\link[varPro]{isopro}}
#' fit so you can plot them, sort them, or write them to disk without having
#' to know the internal shape of the fit.
#'
#' @section What isopro is doing:
#' An isolation forest (Liu, Ting and Zhou 2008) is a random forest grown
#' on very small subsamples of the data and asked to split until each
#' observation lands in its own terminal node. The intuition is geometric:
#' a typical observation sits in the dense middle of the feature cloud and
#' takes many splits to isolate, while an unusual observation sits out
#' near an edge and gets cut off after only a few. So \strong{the depth at
#' which an observation is isolated is a proxy for how typical it is}:
#' shallow depth means anomalous, deep depth means ordinary. Average a
#' single observation's depth across many trees and the noise washes out,
#' leaving a stable per-observation rank.
#'
#' \code{\link[varPro]{isopro}} supports three flavours of isolation
#' forest, which differ in how the splits are chosen:
#' \describe{
#'   \item{\code{"rnd"}}{The original Liu/Ting/Zhou method: each tree node
#'     picks a variable at random and a split point uniformly at random
#'     in the variable's range. Fast, no model, surprisingly effective.}
#'   \item{\code{"unsupv"}}{Unsupervised splitting from
#'     \code{randomForestSRC}: splits are chosen to separate the data
#'     along the directions of highest variance. More structured than
#'     \code{"rnd"}; sometimes more accurate, especially when the
#'     anomalies follow a coherent direction.}
#'   \item{\code{"auto"}}{An auto-encoder formulation that grows a
#'     multivariate forest predicting each feature from the others. Most
#'     expressive, slowest, best suited to low-dimensional data.}
#' }
#' No method is universally best. The varPro authors recommend trying at
#' least two and comparing the score distributions; the plot method here
#' colours per-method curves automatically when you stack the results.
#'
#' @section What's in the output:
#' The fit gives back two parallel per-observation vectors:
#' \code{case.depth} is the raw mean isolation depth (units of "splits",
#' lower = more anomalous) and \code{howbad} is the same information
#' transformed onto a \code{[0, 1]} scale via the empirical CDF of
#' \code{case.depth} (higher = more anomalous). Both columns are kept so
#' you can plot in either space and have the raw depth on hand for
#' diagnostics; \code{howbad} is the canonical score and is what the plot
#' method uses by default.
#'
#' @section What you use this for:
#' This is screening, not inference. Reach for it when you want to:
#' \itemize{
#'   \item flag observations that may be data-entry errors, out-of-range
#'     measurements, or distinct subpopulations before fitting a primary
#'     model;
#'   \item check whether a held-out cohort sits inside the training
#'     distribution before scoring with a model trained elsewhere;
#'   \item give the analyst a ranked list of "look at these first" cases
#'     for a manual review;
#'   \item score a held-out cohort or a fresh batch of incoming data
#'     against a fitted model and compare the test scores to the training
#'     distribution.
#' }
#' The score is a \emph{rank}, not a probability of being an outlier: two
#' observations with \code{howbad = 0.92} are both unusual, not "92\%
#' likely to be anomalous". Pick a cutoff by looking at where the elbow
#' rises; \code{\link{plot.gg_isopro}} can annotate either a score
#' (\code{threshold}) or a top-percent (\code{top_n_pct}) for you.
#'
#' @section Scoring new data:
#' Pass a \code{data.frame} as \code{newdata} and the extractor calls
#' \code{\link[varPro]{predict.isopro}} twice: once with
#' \code{quantiles = FALSE} to get the raw mean case depth per row, and once
#' with \code{quantiles = TRUE} to get the per-row quantile of that depth
#' against the training-data depth distribution.
#'
#' varPro's \code{predict.isopro} returns quantiles where \emph{smaller is
#' more anomalous}, which is the opposite polarity of the wrapper's
#' \code{howbad} (where \emph{higher} is more anomalous). The wrapper
#' exposes both conventions so nothing is hidden:
#' \itemize{
#'   \item \code{case.depth} carries varPro's native polarity, \emph{lower
#'     = more anomalous}. This is the unmodified output of
#'     \code{predict(object, newdata, quantiles = FALSE)}. Use it to
#'     cross-reference against raw varPro output.
#'   \item \code{howbad} is the flipped, wrapper-convention version. The
#'     relationship is \code{howbad = 1 - predict(object, newdata, quantiles = TRUE)}.
#' }
#'
#' To overlay training and test scores in one plot, bind the two extractor
#' calls with a \code{method} label column (the same column
#' \code{\link{plot.gg_isopro}} uses to colour rnd / unsupv / auto
#' comparisons):
#'
#' \preformatted{
#' gg_train <- gg_isopro(fit)
#' gg_test  <- gg_isopro(fit, newdata = test_df)
#' gg_both  <- rbind(cbind(gg_train, method = "train"),
#'                   cbind(gg_test,  method = "test"))
#' class(gg_both) <- c("gg_isopro", "data.frame")
#' plot(gg_both)
#' }
#'
#' @param object An \code{isopro} fit returned by
#'   \code{\link[varPro]{isopro}}.
#' @param ... Currently unused. Present before \code{newdata} so that
#'   \code{newdata} is only matched by name, preserving backward
#'   compatibility with callers of the PR #94 signature
#'   \code{gg_isopro(object, ...)}.
#' @param newdata Optional \code{data.frame} of new observations to score
#'   against the fit. Must be passed by name. When \code{NULL} (default)
#'   the extractor returns the in-sample tidy frame from the fit's stored
#'   \code{$case.depth} and \code{$howbad}. When supplied, each row is
#'   scored via \code{\link[varPro]{predict.isopro}} and the same tidy
#'   shape is returned for the test data.
#'
#' @return A \code{data.frame} of class \code{c("gg_isopro", "data.frame")},
#'   one row per observation. Columns:
#'   \describe{
#'     \item{obs}{Integer; observation index \code{1..n}, in the same
#'       order as the rows of the data passed to
#'       \code{\link[varPro]{isopro}}.}
#'     \item{case.depth}{Numeric; mean isolation depth across the forest.
#'       Lower means the observation was isolated quickly, so more
#'       anomalous.}
#'     \item{howbad}{Numeric in \code{[0, 1]}; the \code{case.depth}
#'       values pushed through their own empirical CDF and flipped so
#'       higher means more anomalous. This is the score the plot method
#'       draws by default.}
#'   }
#'   A \code{provenance} attribute records
#'   \code{source = "varPro::isopro"}, the observation count \code{n}, and
#'   the number of trees \code{ntree}.
#'
#' @section Comparing methods:
#' To compare methods (\code{"rnd"}, \code{"unsupv"}, \code{"auto"}), call
#' \code{\link{gg_isopro}} on each fit and \code{dplyr::bind_rows()} the
#' results with a \code{method} label column. The plot method auto-detects
#' \code{method} and colours the curves.
#'
#' @references
#' Liu, F. T., Ting, K. M., and Zhou, Z. H. (2008). Isolation Forest.
#' \emph{Eighth IEEE International Conference on Data Mining}, 413-422.
#'
#' Ishwaran, H., Mantero, A., and Lu, M. (2025). varPro: Model-Independent
#' Variable Selection via the Rule-Based Variable Priority Framework.
#' \emph{R package version 3.x}.
#'
#' @seealso \code{\link{plot.gg_isopro}}, \code{\link[varPro]{isopro}}
#'
#' @examples
#' \donttest{
#' if (requireNamespace("varPro", quietly = TRUE)) {
#'   set.seed(1)
#'   fit <- varPro::isopro(data = iris[, 1:4], method = "rnd",
#'                         sampsize = 32, ntree = 50)
#'   gg <- gg_isopro(fit)
#'   plot(gg)
#' }
#' }
#'
#' @export
gg_isopro <- function(object, ..., newdata = NULL) {
  UseMethod("gg_isopro", object)
}

#' @export
gg_isopro.isopro <- function(object, ..., newdata = NULL) {
  if (!inherits(object, "isopro")) {
    stop("gg_isopro expects a 'isopro' object from varPro::isopro().",
         call. = FALSE)
  }

  ntree <- tryCatch(
    as.integer(object$isoforest$ntree),
    error = function(e) NA_integer_
  )
  ntree <- if (length(ntree) == 1L && !is.na(ntree)) ntree else NA_integer_

  ## ---- Training path (newdata = NULL) ------------------------------------
  if (is.null(newdata)) {
    # varPro's $howbad uses "lower = more anomalous" polarity (it is the
    # quantile of case.depth, low depth = anomalous). The wrapper convention
    # is "higher = more anomalous", so flip the polarity here the same way
    # the prediction path does (howbad = 1 - quantile).
    howbad <- 1 - as.numeric(object$howbad)
    depth  <- as.numeric(object$case.depth)
    n      <- length(howbad)

    gg_dta <- data.frame(
      obs        = seq_len(n),
      case.depth = depth,
      howbad     = howbad
    )
    class(gg_dta) <- c("gg_isopro", class(gg_dta))
    attr(gg_dta, "provenance") <- list(
      source     = "varPro::isopro",
      n          = n,
      ntree      = ntree,
      prediction = FALSE
    )
    return(invisible(gg_dta))
  }

  ## ---- Prediction path (newdata supplied) -------------------------------
  if (!is.data.frame(newdata)) {
    stop("newdata must be a data.frame.", call. = FALSE)
  }

  # Two calls to predict.isopro: raw depth and quantile-against-training.
  # The wrapper polarity is "higher = more anomalous", so we flip the quantile:
  #   howbad = 1 - predict(object, newdata, quantiles = TRUE)
  # case.depth keeps varPro's native scale (lower = more anomalous), giving
  # the user a varPro-polarity number for cross-reference.
  depth <- as.numeric(stats::predict(object, newdata = newdata,
                                     quantiles = FALSE))
  q     <- as.numeric(stats::predict(object, newdata = newdata,
                                     quantiles = TRUE))
  howbad <- 1 - q
  n      <- nrow(newdata)

  gg_dta <- data.frame(
    obs        = seq_len(n),
    case.depth = depth,
    howbad     = howbad
  )
  class(gg_dta) <- c("gg_isopro", class(gg_dta))
  attr(gg_dta, "provenance") <- list(
    source     = "varPro::isopro",
    n          = n,
    ntree      = ntree,
    prediction = TRUE
  )
  invisible(gg_dta)
}