pif: Proximity Isolation Forest

View source: R/pif.R

pifR Documentation

Proximity Isolation Forest

Description

The function builds a proximity isolation forest that uses fuzzy logic to determine if a record is anomalous or not. The function takes a list object as input and returns it with two vectors appended as attributes. The first vector contains the anomaly scores as numbers between zero and one, and the second vector provides a set of logical values indicating whether the records are outliers (TRUE) or not (FALSE).

Usage

pif(dta, nt = 100L, nss = NULL, max_depth = 12L, threshold = 0.95,
    proximity_type = c("single", "paired", "pivotal"), dist_fun = NULL)

Arguments

dta

A list object with records (stored as individual entries on the list).

nt

Number of deep isolation trees to build to form the forest. By default, it is set to 100.

nss

Number of subsamples used to build a proximity isolation tree in the forest. If set (by default) to NULL, the program will randomly select 25% of the records provided to the dta argument.

max_depth

An integer number corresponding to the maximum depth achieved by a proximity isolation tree in the forest. By default, this argument is set to 12.

threshold

A number between zero and one used as a threshold when identifying outliers from the anomaly scores. By default, this argument is set to 0.95, so that 5% of the records is going to be classified as anomalous.

proximity_type

A character string denoting the number the number of proximity prototypes used by the algorithms (see details for more information). By default, a "single" prototype is randomly chosen to split a branch in the isolation tree.

dist_fun

A function computing the distance between any pair of components in dta. If set (by default) to NULL, the program will select an Euclidean distance for two numerical arrays.

Details

The argument dta is provided as an object of class list. This object is considered as a list of arbitrary R objects that will be analyzed by one of the three algorithms provided with the pif function.

Three algorithms are implemented. The user can choose the proximity type by providing the number of prototypes used to build the isolation trees in the forest. A "single" prototype uses the distance between an input data point to a single randomly selected prototype at each branching node of the tree. Two prototypes (denoted as "paired") are randomly chosen and successively considered as gravitational point of their respective basins of attraction for partitioning the data. An additional "pivotal" point is randomly selected to enhance the algorithm based on two prototypes. In this case, the two distances between a data point and the two prototypes are normalized through the Steinhaus transformation and the pivotal prototype.

Value

The original input list dta with the following attributes appended:

scores

A numeric vector of anomaly scores, ranging from 0 to 1, where higher values indicate a higher likelihood of being an outlier.

flag

A logical vector indicating whether each element in the input list is flagged as an outlier (TRUE) or not (FALSE) based on the specified threshold.

Author(s)

Luca Sartore drwolf85@gmail.com

Examples

## Not run: 
# Load the package
library(HRTnomaly)
set.seed(2025L)
# Personalized distance
my_dst <- function(x, y) {
  xn <- as.numeric(x[[1]][1:4])
  yn <- as.numeric(y[[1]][1:4])
  num <- mean((xn - yn)^2)
  den <- median((xn - yn)^2)
  return(num / (1 + den))
}
# Converting the dataset iris to a list
ir <- apply(iris, 1, list)
# Detect outliers in the `iris` dataset
res_sng <- pif(ir, 5L, 18L, 5L, .85, "single", my_dst)
res_prd <- pif(ir, 5L, 18L, 5L, .85, "paired", my_dst)
res_prx <- pif(ir, 5L, 18L, 5L, .85, "pivotal", my_dst)
# count identified anomalies
print(sum(attr(res_prd, "flag")))

## End(Not run)

HRTnomaly documentation built on Nov. 25, 2025, 5:09 p.m.