| pif | R Documentation |
The function builds a proximity isolation forest that uses fuzzy logic to determine if a record is anomalous or not.
The function takes a list object as input and returns it with two vectors appended as attributes.
The first vector contains the anomaly scores as numbers between zero and one, and the second vector provides
a set of logical values indicating whether the records are outliers (TRUE) or not (FALSE).
pif(dta, nt = 100L, nss = NULL, max_depth = 12L, threshold = 0.95,
proximity_type = c("single", "paired", "pivotal"), dist_fun = NULL)
dta |
A |
nt |
Number of deep isolation trees to build to form the forest. By default, it is set to |
nss |
Number of subsamples used to build a proximity isolation tree in the forest.
If set (by default) to |
max_depth |
An integer number corresponding to the maximum depth achieved by a proximity isolation tree in the forest.
By default, this argument is set to |
threshold |
A number between zero and one used as a threshold when identifying outliers from the anomaly scores.
By default, this argument is set to |
proximity_type |
A character string denoting the number the number of proximity prototypes used by the algorithms (see details for more information).
By default, a |
dist_fun |
A function computing the distance between any pair of components in |
The argument dta is provided as an object of class list.
This object is considered as a list of arbitrary R objects that will be analyzed by one of the three algorithms provided with the pif function.
Three algorithms are implemented. The user can choose the proximity type by providing the number of prototypes used to build the isolation trees in the forest. A "single" prototype uses the distance between an input data point to a single randomly selected prototype at each branching node of the tree. Two prototypes (denoted as "paired") are randomly chosen and successively considered as gravitational point of their respective basins of attraction for partitioning the data. An additional "pivotal" point is randomly selected to enhance the algorithm based on two prototypes. In this case, the two distances between a data point and the two prototypes are normalized through the Steinhaus transformation and the pivotal prototype.
The original input list dta with the following attributes appended:
A numeric vector of anomaly scores, ranging from 0 to 1, where higher values indicate a higher likelihood of being an outlier.
A logical vector indicating whether each element in the input list is flagged as an outlier (TRUE) or not (FALSE) based on the specified threshold.
Luca Sartore drwolf85@gmail.com
## Not run:
# Load the package
library(HRTnomaly)
set.seed(2025L)
# Personalized distance
my_dst <- function(x, y) {
xn <- as.numeric(x[[1]][1:4])
yn <- as.numeric(y[[1]][1:4])
num <- mean((xn - yn)^2)
den <- median((xn - yn)^2)
return(num / (1 + den))
}
# Converting the dataset iris to a list
ir <- apply(iris, 1, list)
# Detect outliers in the `iris` dataset
res_sng <- pif(ir, 5L, 18L, 5L, .85, "single", my_dst)
res_prd <- pif(ir, 5L, 18L, 5L, .85, "paired", my_dst)
res_prx <- pif(ir, 5L, 18L, 5L, .85, "pivotal", my_dst)
# count identified anomalies
print(sum(attr(res_prd, "flag")))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.