outlier.tree: Outlier Tree

View source: R/outliertree.R

outlier.treeR Documentation

Outlier Tree

Description

Fit Outlier Tree model to normal data with perhaps some outliers.

Usage

outlier.tree(
  df,
  max_depth = 4L,
  min_gain = 0.01,
  z_norm = 2.67,
  z_outlier = 8,
  pct_outliers = 0.01,
  min_size_numeric = 25L,
  min_size_categ = 50L,
  categ_split = "binarize",
  categ_outliers = "tail",
  numeric_split = "raw",
  cols_ignore = NULL,
  follow_all = FALSE,
  gain_as_pct = TRUE,
  save_outliers = FALSE,
  outliers_print = 10L,
  min_decimals = 2L,
  nthreads = parallel::detectCores()
)

Arguments

df

Data Frame with regular (i.e. non-outlier) data that might contain some outliers. See details for allowed column types.

max_depth

Maximum depth of the trees to grow. Can also pass zero, in which case it will only look for outliers with no conditions (i.e. takes each column as a 1-d distribution and looks for outliers in there independently of the values in other columns).

min_gain

Minimum gain that a split has to produce in order to consider it (both in terms of looking for outliers in each branch, and in considering whether to continue branching from them). Note that default value for GritBot is 1e-6, with ‘gain_as_pct' = 'FALSE', but it’s recommended to pass higher values (e.g. 1e-1) when using 'gain_as_pct' = 'FALSE'.

z_norm

Maximum Z-value (from standard normal distribution) that can be considered as a normal observation. Note that simply having values above this will not automatically flag observations as outliers, nor does it assume that columns follow normal distributions. Also used for categorical and ordinal columns for building approximate confidence intervals of proportions.

z_outlier

Minimum Z-value that can be considered as an outlier. There must be a large gap in the Z-value of the next observation in sorted order to consider it as outlier, given by (z_outlier - z_norm). Decreasing this parameter is likely to result in more observations being flagged as outliers. Ignored for categorical and ordinal columns.

pct_outliers

Approximate max percentage of outliers to expect in a given branch.

min_size_numeric

Minimum size that branches need to have when splitting a numeric column. In order to look for outliers in a given branch for a numeric column, it must have a minimum of twice this number of observations.

min_size_categ

Minimum size that branches need to have when splitting a categorical or ordinal column. In order to look for outliers in a given branch for a categorical, ordinal, or boolean column, it must have a minimum of twice this number of observations.

categ_split

How to produce categorical-by-categorical splits. Options are:

  • ‘"binarize"' : Will binarize the target variable according to whether it’s equal to each present category within it (greater/less for ordinal), and split each binarized variable separately.

  • '"bruteforce"' : Will evaluate each possible binary split of the categories (that is, it evaluates 2^n potential splits every time). Note that trying this when there are many categories in a column will result in exponential computation time that might never finish.

  • '"separate"' : Will create one branch per category of the splitting variable (this is how GritBot handles them).

categ_outliers

How to look for outliers in categorical variables. Options are:

  • '"tail"' : Will try to flag outliers if there is a large gap between proportions in sorted order, and this gap is unexpected given the prior probabilities. Such criteria tends to sometimes flag too many uninteresting outliers, but is able to detect more cases and recognize outliers when there is no single dominant category.

  • '"majority"' : Will calculate an equivalent to z-value according to the number of observations that do not belong to the non-majority class, according to formula '(n-n_maj)/(n * p_prior) < 1/z_outlier^2'. Such criteria tends to miss many interesting outliers and will only be able to flag outliers in large sample sizes. This is the approach used by GritBot.

numeric_split

How to determine the split point in numeric variables. Options are:

  • ‘"mid"' : Will calculate the midpoint between the largest observation that goes to the ’<=' branch and the smallest observation that goes to the '>' branch.

  • ‘"raw"' : Will set the split point as the value of the largest observation that goes to the ’<=' branch.

This doesn't affect how outliers are determined in the training data passed in 'df', but it does affect the way in which they are presented and the way in which new outliers are detected when using 'predict'. '"mid"' is recommended for continuous-valued variables, while '"raw"' will provide more readable explanations for counts data at the expense of perhaps slightly worse generalizability to unseen data.

cols_ignore

Vector containing columns which will not be split, but will be evaluated for usage in splitting other columns. Can pass either a logical (boolean) vector with the same number of columns as 'df', or a character vector of column names (must match with those of 'df'). Pass 'NULL' to use all columns.

follow_all

Whether to continue branching from each split that meets the size and gain criteria. This will produce exponentially many more branches, and if depth is large, might take forever to finish. Will also produce a lot more spurious outiers. Not recommended.

gain_as_pct

Whether the minimum gain above should be taken in absolute terms, or as a percentage of the standard deviation (for numerical columns) or shannon entropy (for categorical columns). Taking it in absolute terms will prefer making more splits on columns that have a large variance, while taking it as a percentage might be more restrictive on them and might create deeper trees in some columns. For GritBot this parameter would always be 'FALSE'. Recommended to pass higher values for 'min_gain' when passing 'FALSE' here. Not that when 'gain_as_pct' = 'FALSE', the results will be sensitive to the scales of variables.

save_outliers

Whether to store outliers detected in 'df' in the object that is returned. These outliers can then be extracted from the returned object through function 'extract.training.outliers'.

outliers_print

Maximum number of flagged outliers in the training data to print after fitting the model. Pass zero or 'NULL' to avoid printing any. Outliers can be printed from resulting data frame afterwards through the 'predict' method, or through the 'print' method (on the extracted outliers, not on the model object) if passing 'save_outliers' = 'TRUE'.

min_decimals

Minimum number of decimals to use when printing numeric values for the flagged outliers. The number of decimals will be dynamically increased according to the relative magnitudes of the values being reported. Ignored when passing 'outliers_print=0' or 'outliers_print=FALSE'.

nthreads

Number of parallel threads to use. When fitting the model, it will only use up to one thread per column, while for prediction it will use up to one thread per row. The more threads that are used, the more memory will be required and allocated, so using more threads will not always lead to better speed. Can be changed after the object is already initialized.

Details

Explainable outlier detection through decision-tree grouping. Tries to detect outliers by generating decision trees that attempt to "predict" the values of each column based on each other column, testing in each branch of every tried split (if it meets some minimum criteria) whether there are observations that seem too distant from the others in a 1-D distribution for the column that the split tries to "predict" (unlike other methods, this will not generate a score for each observation).

Splits are based on gain, while outlierness is based on confidence intervals. Similar in spirit to the GritBot software developed by RuleQuest research.

Supports columns of types numeric (either as type 'numeric' or 'integer'), categorical (either as type 'character' or 'factor' with unordered levels), boolean (as type 'logical'), and ordinal (as type 'factor' with ordered levels as checked by 'is.ordered'). Can handle missing values in any of them. Can also pass dates/timestamps that will get converted to numeric but shown as dates/timestamps in the output. Offers option to set columns to be used only for generating conditions without looking at outliers in them.

Infinite values will be taken into consideration when the column is used to split another column (that is, +inf will go into the branch that is greater than something, -inf into the other branch), but when a column is the target of the split, they will be taken as missing - that is, it will not report infinite values as outliers.

Value

An object with the fitted model that can be used to detect more outliers in new data, and from which outliers in the training data can be extracted (when passing 'save_outliers' = 'TRUE').

References

See Also

predict.outliertree extract.training.outliers hypothyroid

Examples

library(outliertree)

### example dataset with interesting outliers
data(hypothyroid)

### fit the model and get a print of outliers
model <- outlier.tree(hypothyroid,
  outliers_print=10,
  save_outliers=TRUE,
  nthreads=1)

### extract outlier info as R list
outliers <- extract.training.outliers(model)
summary(outliers)

### information for row 745 (list of lists)
outliers[[745]]

### outliers can be sliced too
outliers[700:1000]

### use custom row names
df.w.names <- hypothyroid
row.names(df.w.names) <- paste0("rownum", 1:nrow(hypothyroid))
outliers.w.names <- predict(model, df.w.names, return_outliers=TRUE, nthreads=1)
outliers.w.names[["rownum745"]]

outliertree documentation built on Nov. 22, 2023, 1:08 a.m.