flag: Flag outliers

Description Usage Arguments Value

View source: R/outlier_detection.R

Description

Flag outliers in a multidimensional data set

Usage

1
2
3
flag(x, level = 0.1, nmax = NULL, side = NULL, crit = "lof",
  asInt = TRUE, k = 5, metric = "euclidean", q = 3,
  na.propagate = FALSE)

Arguments

x

a matrix, data frame or vector of data points (a vector will be understood as 1D data, equivalent to a 1-column matrix). Each row is a data point and each column is a dimension. NA values are allowed and will produce NAs in the output.

level

threshold for finding outliers. Meant to be from 0 to 1. Smaller values mean a higher bar for outliers and so typically detect fewer outliers. For probabilistic methods (such as based on normal distribution), it is the significance level. For LOF, we set the LOF threshold to 1/level -1. level=0 will flag no outliers. LOF method with level=1 will flag all points as outliers (or as many as nmax allows). Note however, that Grubbs method may still leave some points unflagged even with level=1.

nmax

the maximum number of outliers to remove. If NULL, ignored.

side

if set to 'left', 'right' or 'both' (can be abbreviated to one letter and case-insensitive) will flag only the outliers on the left, right or both ends of the 1D distribution. If NULL, all outliers will be flagged. If the data is not 1D, side will be ignored. Note that for the methods that only find outliers on the sides of the distribution (e.g Chauvenet) NULL and 'both' give equivalent results.

crit

criterion to use for identifying outliers. Currently, can be either 'LOF' or 'Grubbs'. Any unambiguous substring can be given, case insensitive. If 'Grubbs', the 1D Grubbs method will be applied along each principal axis of the data and points deemed outliers along at least one axis will be flagged.

asInt

if TRUE, the flag values will be integers (1 for outlier and 0 otherwise). If FALSE, boolean

k

number of nearest neighbors for the LOF calculation

metric

distance metric to use. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given, case insensitive.

q

the power of the Minkowski distance.

na.propagate

boolean to determine what the flag should be for NA values of x. If TRUE, the flag will be NA, otherwise it is flagged as a non-outlier.

Value

a boolean or integer (depending on asInt) vector of the same length as the number of points in the data, containing 1 (TRUE) if a data point is an outlier, 0 (FALSE) if it is not. Depending on na.propagate NA data points get flag value NA or 0 (FALSE).


rushkin/outlier documentation built on Oct. 13, 2018, 10:48 a.m.