outForest: Multivariate Outlier Detection and Replacement by Random...

Description Usage Arguments Details Value References See Also Examples

View source: R/outForest.R

Description

This function provides a random forest based implementation of the method described in Chapter 7.1.2 ("Regression Model Based Anomaly detection") of Chandola et al. Each numeric variable to be checked for outliers is regressed onto all other variables using a random forest. If the scaled absolute difference between observed value and out-of-bag prediction is larger than some predefined threshold (default is 3), then a value is considered an outlier, see Details below. After identification of outliers, they can be replaced e.g. by predictive mean matching from the non-outliers. Since the random forest algorithm 'ranger' does not allow for missing values, any missing value is first being imputed by chained random forests. The method can be viewed as a multivariate extension of a basic univariate outlier detection method where a value is considered an outlier if it is more than e.g. three times the standard deviation away from its mean. In the multivariate case, instead of comparing a value with the overall mean, rather the difference to the conditional mean is considered. The 'outForest' function estimates this conditional mean by a random forest. If the method is trained on a reference data with option allow_predictions, it can be applied to new data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
outForest(
  data,
  formula = . ~ .,
  replace = c("pmm", "predictions", "NA", "no"),
  pmm.k = 3,
  threshold = 3,
  max_n_outliers = Inf,
  max_prop_outliers = 1,
  min.node.size = 40,
  allow_predictions = FALSE,
  impute_multivariate = TRUE,
  impute_multivariate_control = list(pmm.k = 3, num.trees = 50, maxiter = 3L),
  seed = NULL,
  verbose = 1,
  ...
)

Arguments

data

A data.frame to be assessed for numeric outliers.

formula

A two-sided formula specifying variables to be checked (left hand side) and variables used to check (right hand side). Defaults to . ~ ., i.e. use all variables to check all (numeric) variables.

replace

Should outliers be replaced by predicting mean matching on the OOB predictions ("pmm", the default), by OOB predictions ("predictions"), by NA ("NA"). Use "no" to keep outliers as they are.

pmm.k

For replace = "pmm", how many nearest prediction neighbours (without outliers) be considered to sample observed values from?

threshold

Threshold above which an outlier score is considered an outlier. The default is 3.

max_n_outliers

Maximal number of outliers to identify. Will be used in combination with threshold and max_prop_outliers.

max_prop_outliers

Maximal relative count of outliers. Will be used in combination with threshold and max_n_outliers.

min.node.size

Minimal node size of the random forests. With 40, the value is relatively high. This reduces the impact of outliers.

allow_predictions

Should the resulting outForest be used on new data? Default is FALSE as fitted random forests can be huge.

impute_multivariate

If TRUE (default), missing values are imputed by missRanger::missRanger. Otherwise, by univariate sampling.

impute_multivariate_control

Parameters passed to missRanger::missRanger if data contains missing values.

seed

Integer random seed.

verbose

Controls how much outliers is printed to screen. 0 to print nothing, 1 prints information.

...

Arguments passed to ranger. If the data set is large, use less trees (e.g. num.trees = 20) and/or a low value of mtry.

Details

The outlier score of the i-th value x_ij of the j-th variable is defined as s_ij = (x_ij - pred_ij) / rmse_j, where pred_ij is the corresponding out-of-bag prediction of the j-th random forest and rmse_j its RMSE. If |s_ij| > L with threshold L, then x_ij is considered an outlier. For large data sets, just by chance, many values can surpass the default threshold of 3. To reduce the number of outliers, the threshold can be increased. Alternatively, the number of outliers can be limited by the two arguments max_n_outliers and max_prop_outliers. E.g. if at most ten outliers are to be identified, set max_n_outliers = 10.

Value

An object of type 'outForest' and a list with the following elements.

References

  1. Chandola V., Banerjee A., and Kumar V. (2009). Anomaly detection: A survey. ACM Comput. Surv. 41, 3, Article 15 <dx.doi.org/10.1145/1541880.1541882>.

  2. Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. <arxiv.org/abs/1508.04409>.

See Also

outliers, Data, plot.outForest, summary.outForest, predict.outForest.

Examples

1
2
3
4
5
6
head(irisWithOut <- generateOutliers(iris, seed = 345))
(out <- outForest(irisWithOut))
outliers(out)
head(Data(out))
plot(out)
plot(out, what = "scores")

Example output

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Outlier identification by random forests

  Variables to check:		Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
  Variables used to check:	Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species

  Checking: Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  I am an object of class(es) outForest and list 

The following number of outliers have been identified:

             Number of outliers
Sepal.Length                  4
Sepal.Width                   3
Petal.Length                  4
Petal.Width                   2
   row          col   observed  predicted     rmse     score threshold
5   34  Sepal.Width  -8.049668 3.52262141 1.236004 -9.362667         3
13 137  Petal.Width -11.414393 1.92692803 1.477561 -9.029287         3
8   14 Petal.Length  20.118421 0.44364273 3.036131  6.480213         3
4  144 Sepal.Length  -7.694186 6.03453851 2.265894 -6.058855         3
1   16 Sepal.Length  18.192743 5.02003503 2.265894  5.813470         3
10 144 Petal.Length -11.435635 5.54842165 3.036131 -5.593980         3
2   40 Sepal.Length  -7.414174 5.00302348 2.265894 -5.480043         3
3  135 Sepal.Length  18.474192 6.20237447 2.265894  5.415883         3
11 145 Petal.Length -10.274621 5.27119073 3.036131 -5.120270         3
12   7  Petal.Width  -7.038319 0.04314408 1.477561 -4.792670         3
9   33 Petal.Length  -9.622427 2.00996334 3.036131 -3.831320         3
7  145  Sepal.Width   6.519407 2.39920688 1.236004  3.333485         3
6   67  Sepal.Width   6.839742 2.83756540 1.236004  3.237998         3
   replacement
5     3.700000
13    1.900000
8     1.500000
4     6.300000
1     4.900000
10    5.300000
2     4.900000
3     6.200000
11    5.100000
12    0.200000
9    -6.769464
7     2.500000
6     2.900000
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

outForest documentation built on Jan. 7, 2021, 9:10 a.m.