Description Design Details Methods Examples
'solitude' class implements the isolation forest method
introduced by paper Isolation based Anomaly Detection (Liu, Ting and Zhou
<doi:10.1145/2133360.2133363>). The extremely randomized trees (extratrees)
required to build the isolation forest is grown using
ranger
function from ranger package.
$new()
initiates a new 'solitude' object. The
possible arguments are:
sample_size
: (positive integer, default = 256) Number of
observations in the dataset to used to build a tree in the forest
num_trees
: (positive integer, default = 100) Number of trees
to be built in the forest
replace
: (boolean, default = FALSE) Whether the sample of
observations should be chosen with replacement when sample_size is less
than the number of observations in the dataset
seed
: (positive integer, default = 101) Random seed for the
forest
nproc
: (NULL or a positive integer, default: NULL, means use
all resources) Number of parallel threads to be used by ranger
respect_unordered_factors
: (string, default: "partition")See
respect.unordered.factors argument in ranger
max_depth
: (positive number, default:
ceiling(log2(sample_size))) See max.depth argument in
ranger
$fit()
fits a isolation forest for the given dataframe or sparse matrix, computes
depths of terminal nodes of each tree and stores the anomaly scores and
average depth values in $scores
object as a data.table
$predict()
returns anomaly scores for a new data as a data.table
Parallelization: ranger
is parallelized and by
default uses all the resources. This is supported when nproc is set to
NULL. The process of obtaining depths of terminal nodes (which is excuted
with $fit()
is called) may be parallelized separately by setting up
a future backend.
new()
isolationForest$new( sample_size = 256, num_trees = 100, replace = FALSE, seed = 101, nproc = NULL, respect_unordered_factors = NULL, max_depth = ceiling(log2(sample_size)) )
fit()
isolationForest$fit(dataset)
predict()
isolationForest$predict(data)
clone()
The objects of this class are cloneable with this method.
isolationForest$clone(deep = FALSE)
deep
Whether to make a deep clone.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | ## Not run:
library("solitude")
library("tidyverse")
library("mlbench")
data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes
splitter = PimaIndiansDiabetes %>%
select(-diabetes) %>%
rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test = rsample::testing(splitter)
iso = isolationForest$new()
iso$fit(pima_train)
scores_train = pima_train %>%
iso$predict() %>%
arrange(desc(anomaly_score))
scores_train
umap_train = pima_train %>%
scale() %>%
uwot::umap() %>%
setNames(c("V1", "V2")) %>%
as_tibble() %>%
rowid_to_column() %>%
left_join(scores_train, by = c("rowid" = "id"))
umap_train
umap_train %>%
ggplot(aes(V1, V2)) +
geom_point(aes(size = anomaly_score))
scores_test = pima_test %>%
iso$predict() %>%
arrange(desc(anomaly_score))
scores_test
## End(Not run)
|
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2 ✔ purrr 0.3.4
✔ tibble 3.0.4 ✔ dplyr 1.0.2
✔ tidyr 1.1.2 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
# A tibble: 768 x 9
pregnant glucose pressure triceps insulin mass pedigree age diabetes
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 0 33.6 0.627 50 pos
2 1 85 66 29 0 26.6 0.351 31 neg
3 8 183 64 0 0 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 0 0 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 0 0 0 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 0 0 0 0.232 54 pos
# … with 758 more rows
INFO [00:54:28.436] Building Isolation Forest ...
INFO [00:54:30.004] done
INFO [00:54:30.018] Computing depth of terminal nodes ...
INFO [00:54:30.798] done
INFO [00:54:30.836] Completed growing isolation forest
id average_depth anomaly_score
1: 229 4.93 0.7163710
2: 296 5.26 0.7005536
3: 96 5.67 0.6813873
4: 181 5.69 0.6804659
5: 196 6.35 0.6507483
---
380: 349 8.00 0.5820092
381: 360 8.00 0.5820092
382: 361 8.00 0.5820092
383: 362 8.00 0.5820092
384: 383 8.00 0.5820092
Warning message:
The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 384 x 5
rowid V1 V2 average_depth anomaly_score
<int> <dbl> <dbl> <dbl> <dbl>
1 1 -1.78 -1.72 7.96 0.584
2 2 -1.15 -1.41 7.98 0.583
3 3 -1.85 0.753 6.71 0.635
4 4 -0.985 -5.18 7.53 0.601
5 5 3.13 0.564 7.84 0.588
6 6 -0.934 -5.14 7.62 0.597
7 7 -2.62 -0.288 7.56 0.600
8 8 -2.09 0.176 8 0.582
9 9 -2.09 1.06 7.9 0.586
10 10 3.72 0.876 7.92 0.585
# … with 374 more rows
id average_depth anomaly_score
1: 34 5.70 0.6800056
2: 166 5.86 0.6726840
3: 252 5.94 0.6690528
4: 83 6.51 0.6437417
5: 109 6.52 0.6433063
---
380: 271 8.00 0.5820092
381: 273 8.00 0.5820092
382: 322 8.00 0.5820092
383: 323 8.00 0.5820092
384: 349 8.00 0.5820092
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.