isolationForest: Fit an Isolation Forest

Description Design Details Methods Examples

Description

'solitude' class implements the isolation forest method introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou <doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees) required to build the isolation forest is grown using ranger function from ranger package.

Design

$new() initiates a new 'solitude' object. The possible arguments are:

$fit() fits a isolation forest for the given dataframe or sparse matrix, computes depths of terminal nodes of each tree and stores the anomaly scores and average depth values in $scores object as a data.table

$predict() returns anomaly scores for a new data as a data.table

Details

Methods

Public methods


Method new()

Usage
isolationForest$new(
  sample_size = 256,
  num_trees = 100,
  replace = FALSE,
  seed = 101,
  nproc = NULL,
  respect_unordered_factors = NULL,
  max_depth = ceiling(log2(sample_size))
)

Method fit()

Usage
isolationForest$fit(dataset)

Method predict()

Usage
isolationForest$predict(data)

Method clone()

The objects of this class are cloneable with this method.

Usage
isolationForest$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
## Not run: 
library("solitude")
library("tidyverse")
library("mlbench")

data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes

splitter   = PimaIndiansDiabetes %>%
  select(-diabetes) %>%
  rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test  = rsample::testing(splitter)

iso = isolationForest$new()
iso$fit(pima_train)

scores_train = pima_train %>%
  iso$predict() %>%
  arrange(desc(anomaly_score))

scores_train

umap_train = pima_train %>%
  scale() %>%
  uwot::umap() %>%
  setNames(c("V1", "V2")) %>%
  as_tibble() %>%
  rowid_to_column() %>%
  left_join(scores_train, by = c("rowid" = "id"))

umap_train

umap_train %>%
  ggplot(aes(V1, V2)) +
  geom_point(aes(size = anomaly_score))

scores_test = pima_test %>%
  iso$predict() %>%
  arrange(desc(anomaly_score))

scores_test

## End(Not run)

Example output

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2purrr   0.3.4tibble  3.0.4dplyr   1.0.2tidyr   1.1.2stringr 1.4.0readr   1.4.0forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()dplyr::lag()    masks stats::lag()
# A tibble: 768 x 9
   pregnant glucose pressure triceps insulin  mass pedigree   age diabetes
      <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl> <fct>   
 1        6     148       72      35       0  33.6    0.627    50 pos     
 2        1      85       66      29       0  26.6    0.351    31 neg     
 3        8     183       64       0       0  23.3    0.672    32 pos     
 4        1      89       66      23      94  28.1    0.167    21 neg     
 5        0     137       40      35     168  43.1    2.29     33 pos     
 6        5     116       74       0       0  25.6    0.201    30 neg     
 7        3      78       50      32      88  31      0.248    26 pos     
 8       10     115        0       0       0  35.3    0.134    29 neg     
 9        2     197       70      45     543  30.5    0.158    53 pos     
10        8     125       96       0       0   0      0.232    54 pos     
# … with 758 more rows
INFO  [00:54:28.436] Building Isolation Forest ...  
INFO  [00:54:30.004] done 
INFO  [00:54:30.018] Computing depth of terminal nodes ...  
INFO  [00:54:30.798] done 
INFO  [00:54:30.836] Completed growing isolation forest 
      id average_depth anomaly_score
  1: 229          4.93     0.7163710
  2: 296          5.26     0.7005536
  3:  96          5.67     0.6813873
  4: 181          5.69     0.6804659
  5: 196          6.35     0.6507483
 ---                                
380: 349          8.00     0.5820092
381: 360          8.00     0.5820092
382: 361          8.00     0.5820092
383: 362          8.00     0.5820092
384: 383          8.00     0.5820092
Warning message:
The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
# A tibble: 384 x 5
   rowid     V1     V2 average_depth anomaly_score
   <int>  <dbl>  <dbl>         <dbl>         <dbl>
 1     1 -1.78  -1.72           7.96         0.584
 2     2 -1.15  -1.41           7.98         0.583
 3     3 -1.85   0.753          6.71         0.635
 4     4 -0.985 -5.18           7.53         0.601
 5     5  3.13   0.564          7.84         0.588
 6     6 -0.934 -5.14           7.62         0.597
 7     7 -2.62  -0.288          7.56         0.600
 8     8 -2.09   0.176          8            0.582
 9     9 -2.09   1.06           7.9          0.586
10    10  3.72   0.876          7.92         0.585
# … with 374 more rows
      id average_depth anomaly_score
  1:  34          5.70     0.6800056
  2: 166          5.86     0.6726840
  3: 252          5.94     0.6690528
  4:  83          6.51     0.6437417
  5: 109          6.52     0.6433063
 ---                                
380: 271          8.00     0.5820092
381: 273          8.00     0.5820092
382: 322          8.00     0.5820092
383: 323          8.00     0.5820092
384: 349          8.00     0.5820092

solitude documentation built on July 30, 2021, 1:07 a.m.