knitr::opts_chunk$set(echo = TRUE, fig.width=5, fig.height=5, fig.align='center')
library(isofor)

Isolation Forest

An Isolation Forest is an ensemble of completely random decision trees. At each split a random feature and a random split point is chosen. Anomalies are isolated if they end up in a partition far away from the rest of the data. In decision tree terms, this corresponds to a record that has a short "path length". The path length is the number of nodes that a record passes through before terminating in a leaf node. Records with short average path lengths through the entire ensemble are considered anomalies.

An analogy

Describing the location of a country home takes many fewer directions than describing the location of a brownstone in Brooklyn. The country home might be described as "the only house on the south shore of Lake Woebegon". While directions to the brownstone must be qualified with much more detail: "Go north on 5th Street for 12 blocks, take a left on Van Buren, etc.."

Isolated | Dense :-------------------------:|:-------------------------: |

The country house in this example is a literal outlier. It is off by itself away from most other homes. Similarly, records that can be described succinctly are also outliers.

Example

Here we create two random, normal vectors and add some outliers. The majority of the data points are centered around (0, 0) with a standard deviation of 1/2. 50 outliers are introduced and are centered around (-1.5, 1.5) with a standard deviation of 1. This is to encourage some co-mingling of outliers with the bulk of the data.

N = 1e3
x = c(rnorm(N, 0, 0.5), rnorm(N*0.05, -1.5, 1))
y = c(rnorm(N, 0, 0.5), rnorm(N*0.05,  1.5, 1))
ol = c(rep(0, N), rep(1, (0.05*N))) + 2
data = data.frame(x, y)
plot(data, pch=ol)
title("Dummy data with outliers")

The code below builds an Isolation Forest by passing in the dummy data, the number of trees requested (100) and the number of records to subsample for each tree (32). The records that exceed the 95% percentile of the anomaly score should flag the most anomalous records. By coloring such records as red and plotting the results the effectiveness of the Isolation Forest can be viewed.

mod = iForest(X = data, 100, 32)
p = predict(mod, data)
col = ifelse(p > quantile(p, 0.95), "red", "blue")
plot(x, y, col=col, pch=ol)

Knowing there are two populations, the Kmeans algorithm seems like a good fit for identifying the two clusters. However, we can see that it picks cluster centers that do not do a good job of separating the data.

km = kmeans(data, 2)
plot(x, y, col=km$cluster+1, pch=ol)

Node Membership

In addition to predicting an anomaly score, a node membership matrix can also be predicted. There are two ways to produce a nod membership matrix. The first returns a matrix of size n_obs x n_trees and contains the terminal node ID for each observation:

node_ids = predict(mod, data, nodes=TRUE)
head(node_ids[,1:10])

THe second uses the Matrix library to return a sparse matrix of 1s and 0s indicating node membership. There are as many ones as there are n_obs x n_trees but the dimension is much larger at n_obs x n_terminal_nodes:

library(Matrix)
nodes = predict(mod, data, sparse=TRUE)
head(nodes[,1:10])


Zelazny7/isofor documentation built on Aug. 28, 2019, 7:12 p.m.