knitr::opts_chunk$set( collapse = TRUE ) library("dsos")
Please see the
arXiv paper for details.
We denote the R package as dsos
, to avoid confusion with D-SOS
, the method.
We show how easy it is to implement D-SOS
for a particular notion of
outlyingness. Suppose we want to test for no adverse shift based on isolation
scores in the context of multivariate two-sample comparison. To do so, we need
two main ingredients: a score function and a method to compute the $p-$value.
First, the scores are obtained using predictions from isolation forest with the
isotree
package [@isotree]. Isolation forest detects isolated points,
instances that are typically out-of-distribution relative to the high-density
regions of the data distribution. Naturally, any performant method for
density-based out-of-distribution detection can effectively be used to achieve
the same goal. Isolation forest just happens to be a convenient way to do this.
The internal function outliers_no_split
shows the implementation of one such
score function in the dsos
package.
dsos:::outliers_no_split
Second, we estimate the empirical null distribution for the $p-$value via
permutations. For speed, this is implemented as a sequential Monte Carlo test
with the simctest
package [@simctest]. Permutations do not
require to derive the asymptotic null distribution for the
test statistic. The function od_pt
in the dsos
package combines
the scoring with the inference. The prefix od stands for outlier
detection and the suffix pt, for permutations. dsos
sometimes provides
sample splitting and out-of-bag variants as alternatives to compute $p-$values.
Both sample splitting and out-of-bag variants use the asymptotic
null distribution for the test statistic. As a result, they can be
appreciably faster than inference based on permutations. The code for od_pt
is relatively straightforward.
dsos::od_pt
Take the iris
dataset
for example. When the training set only consists of Iris setosa (flower species)
and the test set, only of Iris versicolor, the data is incompatible with the null
of no adverse shift. In other words, we have strong evidence that the test
contains a disproportionate number of outliers, if the training set is the
reference distribution.
set.seed(12345) data(iris) x_train <- iris[1:50,1:4] # Training sample: Species == 'setosa' x_test <- iris[51:100,1:4] # Test sample: Species == 'versicolor' iris_test <- od_pt(x_train, x_test) plot(iris_test)
You can plug in your own scores in this framework. Those already implemented in
the package can be useful but they are by means the only ones. If you favor a
different method for out-of-distribution (outlier) detection, want to tune
the hyperparameters, or choose a different notion of outlyingness altogether,
dsos
provides the building blocks to build your own. The workhorse function,
powering the approach behind the scenes, is a way to calculate the the test
statistic, the WAUC, from the outlier scores (see wauc_from_os
).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.