HDoutliers: Leland Wilkinson's _hdoutliers_ Algorithm for Outlier...

Description Usage Arguments Details Value References See Also Examples

View source: R/HDoutliers.R

Description

Detects outliers based on a probability model.

Usage

1
HDoutliers(data, maxrows=10000, radius=NULL, alpha=0.05, transform=TRUE) 

Arguments

data

A vector, matrix, or data frame consisting of numeric and/or categorical variables.

maxrows

If the number of observations is greater than maxrows, HDoutliers reduces the number used in nearest-neighbor computations to a set of exemplars. The default value is 10000.

radius

Threshold for determining membership in the exemplars's lists (used only when the number of observations is greater than maxrows). An observation is added to an exemplars' lists if its distance to that exemplar is less than radius. The default value is .1/(log n)^(1/p), where n is the number of observations and p is the dimension of the data.

alpha

Threshold for determining the cutoff for outliers. Observations are considered outliers outliers if they fall in the (1- alpha) tail of the distribution of the nearest-neighbor distances between exemplars.

transform

A logical variable indicating whether or not the data needs to be transformed to conform to Wilkinson's specifications before outlier detection. The default is to transform the data using function dataTrans.

Details

Wilkinson replaces categorical variables with the leading component from correspondence analysis, and maps the data to the unit square. This is done as a preprocessing step if transform = TRUE (the default).
If the number of observations exceeds maxrows, the data is first partitioned into lists associated with exemplars and their members within radius of each exemplar, to reduce the number of nearest-neighbor computations required for outlier detection.
An exponential distribution is then fitted to the upper tail of the nearest-neighbor distances between exemplars. Observations are considered outliers if they fall in the (1- alpha) tail of the fitted CDF.

Value

The indexes of the observations determined to be outliers.

References

Wilkinson, L. (2016). Visualizing Outliers.

See Also

getHDmembers, getHDoutliers, dataTrans

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
data(dots)
out.W <- HDoutliers(dots$W)
## Not run: 
plotHDoutliers(dots$W,out.W)
## End(Not run)

data(ex2D)
out.ex2D <- HDoutliers(ex2D)
## Not run: 
plotHDoutliers(ex2D,out.ex2D)
## End(Not run)

## Not run: 
n <- 100000 # number of observations
set.seed(3)
x <- matrix(rnorm(2*n),n,2)
nout <- 10 # number of outliers
x[sample(1:n,size=nout),] <- 10*runif(2*nout,min=-1,max=1)

out.x <- HDoutliers(x)
## End(Not run)

HDoutliers documentation built on Feb. 11, 2022, 5:10 p.m.