knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
IsolationForest implements Isolation Forests and Extended Isolation Forests in R, optionally parallelized (for speed) using the future framework.
Originally, this was cloned from Zelazny7/isofor GitHub. This package, however, also implements the Extended Isolation Forests of Hariri et al.(2019). Further, there support for optional encoding of categorical variables using the categoryEncodings package, and missing values can be handled via Missingness Incorporated in Attributes splitting (Kwala et al. (2008)).
NOTE The package does not explicitly handle factor splitting, as it seems encoding factors might be a more reasonable approach for trees, see 'Sufficient Representations for Categorical Variables' by Johannemann et al. (2019) - the package currently supports these encodings, hence the use of categoryEncodings.
You can install the development version from GitHub with:
# install.packages("devtools") devtools::install_github("JSzitas/IsolationForests")
Generate random data with anomalies:
set.seed(1071) X <- rnorm( 500, 0, 1) Y <- rnorm( 500, 0, 1) replace_x <- sample(1:500, 20 ) replace_y <- sample(1:500, 50 ) X[replace_x] <- rnorm(20, mean = 3, sd = 2) Y[replace_y] <- rnorm(50, mean = -4, sd = 1.5) anomaly_indicator <- rep(0,500) anomaly_indicator[replace_x] <- 1 anomaly_indicator[replace_y] <- 1 anomaly_indicator <- as.factor(anomaly_indicator) test_data <- data.frame(X, Y, anomaly_indicator) ggplot2::ggplot(data = test_data, ggplot2::aes( x = X, y = Y, colour = anomaly_indicator, shape = anomaly_indicator )) + ggplot2::geom_point(size = 1.9) + ggplot2::scale_colour_manual(name = "Anomaly", values = c("#2554C7","#E42217")) + ggplot2::scale_shape_manual(name = "Anomaly", values = c(15,17))
We have a total of
total_anomalous <- sum(unlist(anomaly_indicator == 1))
anomalous values. Keep that in mind for later.
Now try fitting an Isolation Forest:
library(IsolationForest) fit <- isolationForest( X = test_data[,1:2], # we dont want column 3 here. n_trees = 1000, Phi = 64, # subsampling rate for individual trees parallel = TRUE, # defaults to future::plan("multiprocess") future_plan = "multiprocess", # change this argument # to change the plan extension_level = 1, # how 'extended' should the trees be? vanilla = FALSE # whether to fit an unextended, original # isolation forest )
Then to get the anomaly scores we just call
scored_data <- predict.isolationForest(fit, test_data[,1:2])
We can additionaly generate 2 dimensional contour plots by calling
anomaly_plot( x = "X", y = "Y", forest = fit, data = test_data[,1:2], contour = TRUE )
Or we can plot the individual point, classified as anomalous or not
anomaly_plot( x = "X", y = "Y", forest = fit, data = test_data[,1:2], contour = FALSE, contamination = 0.15 # we have contaminated total_anomalous/nrow(test_data), # observations )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.