View source: R/adjoutlyingness.R
adjOutlyingness  R Documentation 
For an n \times p
data matrix (or data frame) x
,
compute the “outlyingness” of all n
observations.
Outlyingness here is a generalization of the DonohoStahel
outlyingness measure, where skewness is taken into account via the
medcouple, mc()
.
adjOutlyingness(x, ndir = 250, p.samp = p, clower = 4, cupper = 3,
IQRtype = 7,
alpha.cutoff = 0.75, coef = 1.5,
qr.tol = 1e12, keep.tol = 1e12,
only.outlyingness = FALSE, maxit.mult = max(100, p),
trace.lev = 0,
mcReflect = n <= 100, mcScale = TRUE, mcMaxit = 2*maxit.mult,
mcEps1 = 1e12, mcEps2 = 1e15,
mcTrace = max(0, trace.lev1))
x 
a numeric 
ndir 
positive integer specifying the number of directions that should be searched. 
p.samp 
the sample size to use for finding good random
directions, must be at least 
clower, cupper 
the constant to be used for the lower and upper
tails, in order to transform the data towards symmetry. You can set

IQRtype 
a number from 
alpha.cutoff 
number in (0,1) specifying the quantiles

coef 
positive number specifying the factor with which the
interquartile range ( 
qr.tol 
positive tolerance to be used for 
keep.tol 
positive tolerance to determine which of the sample
direction should be kept, namely only those for which

only.outlyingness 
logical indicating if the final outlier determination should be skipped. In that case, a vector is returned, see ‘Value:’ below. 
maxit.mult 
integer factor; 
trace.lev 
an integer, if positive allows to monitor the direction search. 
mcReflect 
passed as 
mcScale 
passed as 
mcMaxit 
passed as 
mcEps1 
passed as 
mcEps2 
passed as 
mcTrace 
passed as 
FIXME: Details in the comment of the Matlab code; also in the reference(s).
The method as described can be useful as preprocessing in FASTICA (http://research.ics.aalto.fi/ica/fastica/ see also the R package fastICA.
If only.outlyingness
is true, a vector adjout
,
otherwise, as by default, a list with components
adjout 
numeric of 
cutoff 
cutoff for “outlier” with respect to the adjusted
outlyingnesses, and depending on 
nonOut 
logical of 
If there are too many degrees of freedom for the projections, i.e., when
n \le 4p
, the current definition of adjusted outlyingness
is illposed, as one of the projections may lead to a denominator
(quartile difference) of zero, and hence formally an adjusted
outlyingness of infinity.
The current implementation avoids Inf
results, but will return
seemingly random adjout
values of around 10^{14}  10^{15}
which may
be completely misleading, see, e.g., the longley
data example.
The result is random as it depends on the sample of
ndir
directions chosen; specifically, to get sub samples the algorithm uses
sample.int(n, p.samp)
which from R version 3.6.0 depends on
RNGkind(*, sample.kind)
. Exact reproducibility of results
from R versions 3.5.3 and earlier, requires setting
RNGversion("3.5.0")
.
In any case, do use set.seed()
yourself
for reproducibility!
Till Aug/Oct. 2014, the default values for clower
and cupper
were
accidentally reversed, and the signs inside exp(.)
where swapped
in the (now corrected) two expressions
tup < Q3 + coef * IQR * exp(.... + clower * tmc * (tmc < 0)) tlo < Q1  coef * IQR * exp(....  cupper * tmc * (tmc < 0))
already in the code from Antwerpen (‘mcrsoft/adjoutlingness.R’), contrary to the published reference.
Further, the original algorithm had not been scaleequivariant in the direction construction, which has been amended in 201410 as well.
The results, including diagnosed outliers, therefore have changed, typically slightly, since robustbase version 0.920.
Guy Brys; help page and improvements by Martin Maechler
Brys, G., Hubert, M., and Rousseeuw, P.J. (2005) A Robustification of Independent Component Analysis; Journal of Chemometrics, 19, 1–12.
Hubert, M., Van der Veeken, S. (2008) Outlier detection for skewed data; Journal of Chemometrics 22, 235–246; \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1002/cem.1123")}.
For the uptodate reference, please consult https://wis.kuleuven.be/statdatascience/robust
the adjusted boxplot, adjbox
and the medcouple,
mc
.
## An Example with bad condition number and "border case" outliers
dim(longley) # 16 x 7 // set seed, as result is random :
set.seed(31)
ao1 < adjOutlyingness(longley, mcScale=FALSE)
## which are outlying ?
which(!ao1$nonOut) ## for this seed, two: "1956", "1957"; (often: none)
## For seeds 1:100, we observe (Linux 64b)
if(FALSE) {
adjO < sapply(1:100, function(iSeed) {
set.seed(iSeed); adjOutlyingness(longley)$nonOut })
table(nrow(longley)  colSums(adjO))
}
## #{outl.}: 0 1 2 3
## #{cases}: 74 17 6 3
## An Example with outliers :
dim(hbk)
set.seed(1)
ao.hbk < adjOutlyingness(hbk)
str(ao.hbk)
hist(ao.hbk $adjout)## really two groups
table(ao.hbk$nonOut)## 14 outliers, 61 nonoutliers:
## outliers are :
which(! ao.hbk$nonOut) # 1 .. 14  but not for all random seeds!
## here, they are(*) the same as found by (much faster) MCD:
## *) only "almost", since the 202305 change to covMcd()
cc < covMcd(hbk)
table(cc = cc$mcd.wt, ao = ao.hbk$nonOut)# one differ..:
stopifnot(sum(cc$mcd.wt != ao.hbk$nonOut) <= 1)
## This is revealing: About 12 cases, where outliers are *not* == 1:14
## (2023: ~ 1/8 [sec] per call)
if(interactive()) {
for(i in 1:30) {
print(system.time(ao.hbk < adjOutlyingness(hbk)))
if(!identical(iout < which(!ao.hbk$nonOut), 1:14)) {
cat("Outliers:\n"); print(iout)
}
}
}
## "Central" outlyingness: *not* calling mc() anymore, since 20141211:
trace(mc)
out < capture.output(
oo < adjOutlyingness(hbk, clower=0, cupper=0)
)
untrace(mc)
stopifnot(length(out) == 0)
## A rankdeficient case
T < tcrossprod(data.matrix(toxicity))
try(adjOutlyingness(T, maxit. = 20, trace.lev = 2)) # fails and recommends:
T. < fullRank(T)
aT < adjOutlyingness(T.)
plot(sort(aT$adjout, decreasing=TRUE), log="y")
plot(T.[,9:10], col = (1:2)[1 + (aT$adjout > 10000)])
## .. (not conclusive; directions are random, more 'ndir' makes a difference!)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.