Compute (Skewness-adjusted) Multivariate Outlyingness

Share:

Description

For an n * p data matrix (or data frame) x, compute the “outlyingness” of all n observations. Outlyingness here is a generalization of the Donoho-Stahel outlyingness measure, where skewness is taken into account via the medcouple, mc().

Usage

1
2
3
4
adjOutlyingness(x, ndir = 250, clower = 4, cupper = 3,
                alpha.cutoff = 0.75, coef = 1.5,
                qr.tol = 1e-12, keep.tol = 1e-12,
                only.outlyingness = FALSE)

Arguments

x

a numeric matrix or data.frame.

ndir

positive integer specifying the number of directions that should be searched.

clower, cupper

the constant to be used for the lower and upper tails, in order to transform the data towards symmetry. You can set clower = 0, cupper = 0 to get the non-adjusted, i.e., classical (“central” or “symmetric”) outlyingness. In that case, mc() is not used.

alpha.cutoff

number in (0,1) specifying the quantiles (α, 1-α) which determine the “outlier” cutoff. The default, using quartiles, corresponds to the definition of the medcouple (mc), but there is no stringent reason for using the same alpha for the outlier cutoff.

coef

positive number specifying the factor with which the interquartile range (IQR) is multiplied to determine ‘boxplot hinges’-like upper and lower bounds.

qr.tol

positive tolerance to be used for qr and solve.qr for determining the ndir directions, each determined by a random sample of p (out of n) observations.

keep.tol

positive tolerance to determine which of the sample direction should be kept, namely only those for which ||x|| * ||B|| is larger than keep.tol.

only.outlyingness

logical indicating if the final outlier determination should be skipped. In that case, a vector is returned, see ‘Value:’ below.

Details

FIXME: Details in the comment of the Matlab code; also in the reference(s).

The method as described can be useful as preprocessing in FASTICA (http://www.cis.hut.fi/projects/ica/fastica/; see also the R package fastICA.

Value

If only.outlyingness is true, a vector adjout, otherwise, as by default, a list with components

adjout

numeric of length(n) giving the adjusted outlyingness of each observation.

cutoff

cutoff for “outlier” with respect to the adjusted outlyingnesses, and depending on alpha.cutoff.

nonOut

logical of length(n), TRUE when the corresponding observation is non-outlying with respect to the cutoff and the adjusted outlyingnesses.

Note

The result is random as it depends on the sample of ndir directions chosen.

Till Aug/Oct. 2014, the default values for clower and cupper were accidentally reversed, and the signs inside exp(.) where swapped in the (now corrected) two expressions

1
2
 tup <- Q3 + coef * IQR * exp(.... + clower * tmc * (tmc < 0))
 tlo <- Q1 - coef * IQR * exp(.... - cupper * tmc * (tmc < 0))

already in the code from Antwerpen (‘mcrsoft/adjoutlingness.R’), contrary to the published reference.

Further, the original algorithm had not been scale-equivariant in the direction construction, which has been amended in 2014-10 as well.

The results, including diagnosed outliers, therefore have changed, typically slightly, since robustbase version 0.92-0.

Author(s)

Guy Brys; help page and improvements by Martin Maechler

References

Brys, G., Hubert, M., and Rousseeuw, P.J. (2005) A Robustification of Independent Component Analysis; Journal of Chemometrics, 19, 1–12.

For the up-to-date reference, please consult http://wis.kuleuven.be/stat/robust

See Also

the adjusted boxplot, adjbox and the medcouple, mc.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## An Example with bad condition number and "border case" outliers

dim(longley)
set.seed(1) ## result is random!
ao1 <- adjOutlyingness(longley)
## which are outlying ?
which(!ao1$nonOut) ## one: "1948" - for this seed! (often: none)
stopifnot(all(ao1$nonOut[-2]))

## An Example with outliers :

dim(hbk)
set.seed(1)
ao.hbk <- adjOutlyingness(hbk)
str(ao.hbk)
hist(ao.hbk $adjout)## really two groups
table(ao.hbk$nonOut)## 14 outliers, 61 non-outliers:
## outliers are :
which(! ao.hbk$nonOut) # 1 .. 14   --- but not for all random seeds!

## here, they are the same as found by (much faster) MCD:
cc <- covMcd(hbk)
stopifnot(all(cc$mcd.wt == ao.hbk$nonOut))

## This is revealing (about 1--2 cases, where outliers are *not* == 1:14
##  but needs almost 1 [sec] per call:
if(interactive()) {
  for(i in 1:30) {
    print(system.time(ao.hbk <- adjOutlyingness(hbk)))
    if(!identical(iout <- which(!ao.hbk$nonOut), 1:14)) {
	 cat("Outliers:\n"); print(iout)
    }
  }
}

## "Central" outlyingness: *not* calling mc()  anymore, since 2014-12-11:
trace(mc)
out <- capture.output(
  oo <- adjOutlyingness(hbk, clower=0, cupper=0)
)
untrace(mc)
stopifnot(length(out) == 0)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.