mvBACON: BACON: Blocked Adaptive Computationally-Efficient Outlier...

Description Usage Arguments Value Author(s) References See Also Examples

Description

This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure.

Usage

1
2
3
mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.95,
        init.sel = c("Mahalanobis", "dUniMedian", "random", "manual"),
        man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE)

Arguments

x

numeric matrix (of dimension [n x p]), not supposed to contain missing values.

collect

a multiplication factor c, when init.sel is not "manual", to define m, the size of the initial basic subset, as c * p, in practice, m <- min(p * collect, n/2).

m

integer in 1:n specifying the size of the initial basic subset; used only when init.sel is not "manual".

alpha

significance level for the chisq cutoff, used to define the next iterations basic subset.

init.sel

character string, specifying the initial selection mode; implemented modes are:

"Mahalanobis"

based on Mahalanobis distances (default); the version V1 of the reference; affine invariant but not robust.

"dUniMedian"

based on the distances from the univariate medians; ; the version V2 of the reference; robust but not affine invariant.

"random"

based on a random selection, i.e., reproducible only via set.seed().

"manual"

based on manual selection; in this case, a vector man.sel containing the indices of the selected observations must be specified.

"Mahalanobis", "dUniMedian" where proposed by Hadi and the other authors in the reference as versions ‘V_1’ and ‘V_2’, as well as "manual", while "random" is provided in order to study the behaviour of BACON.

man.sel

only when init.sel == "manual", the indices of observations determining the initial basic subset (and m <- length(man.sel)).

maxsteps

maximal number of iteration steps.

allowSingular

logical indicating a solution should be sought also when no matrix of rank p is found.

verbose

logical indicating if messages are printed which trace progress of the algorithm.

Value

a list with components

subset

logical vector of length n where the i-th entry is true iff the i-th observation is part of the final selection.

dis

numeric vector of length n with the (Mahalanobis) distances.

cov

p x p matrix, the corresponding robust estimate of covariance.

Author(s)

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1. Port to R, testing etc, by Martin Maechler

References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. doi: 10.1016/S0167-9473(99)00101-2

See Also

covMcd for a high-breakdown (but more computer intensive) method; BACON for a “generalization”, notably to regression.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
 require(robustbase) # for example data and covMcd():
## simple 2D example :
 plot(starsCYG, main = "starsCYG  data  (n=47)")
 B.st <- mvBACON(starsCYG)
 points(starsCYG[ ! B.st$subset,], pch = 4, col = 2, cex = 1.5)
 stopifnot(identical(which(!B.st$subset), c(7L,9L,11L,14L,20L,30L,34L)))
 ## finds the clear outliers (and 3 "borderline")

 ## 'coleman' from pkg 'robustbase'
 coleman.x <- data.matrix(coleman[, 1:6])
 Cc <- covMcd (coleman.x) # truly robust
 summary(Cc) # -> 6 outliers (1,3,10,12,17,18)
 Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm??
 Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian")
 stopifnot(all.equal(Cb1, Cb2))
 Cb.r <- lapply(1:20, function(i) { set.seed(i)
                     mvBACON(coleman.x, init.sel="random", verbose=FALSE) })
 nm <- names(Cb.r[[1]]); nm <- nm[nm != "steps"]
 all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[[1]][nm]))) # TRUE
 ## --> BACON always  breaks down, i.e., does not see the outliers here
 ## breaks down even when manually starting with all the non-outliers:
 Cb.man <- mvBACON(coleman.x, init.sel = "manual",
                   man.sel = setdiff(1:20, c(1,3,10,12,17,18)))
 which( ! Cb.man$subset) # the outliers according to mvBACON : _none_

Example output

Loading required package: robustbase
MV-BACON (subset no. 1): 8 of 47 (17.02 %)
MV-BACON (subset no. 2): 26 of 47 (55.32 %)
MV-BACON (subset no. 3): 34 of 47 (72.34 %)
MV-BACON (subset no. 4): 41 of 47 (87.23 %)
MV-BACON (subset no. 5): 40 of 47 (85.11 %)
MV-BACON (subset no. 6): 40 of 47 (85.11 %)
Minimum Covariance Determinant (MCD) estimator approximation.
Method: Fast MCD(alpha=0.5 ==> h=13); nsamp = 500; (n,k)mini = (300,5)
Call:
covMcd(x = coleman.x)
Log(Det.):  1.558 

Robust Estimate of Location:
  salaryP   fatherWc    sstatus  teacherSc  motherLev          Y  
    2.615     43.302      2.805     24.766      6.271     34.733  
Robust Estimate of Covariance:
           salaryP  fatherWc  sstatus  teacherSc  motherLev        Y
salaryP     0.5131     9.193    2.115     1.2075     0.1364    2.586
fatherWc    9.1930  2866.913  918.993    26.4050    65.2060  621.151
sstatus     2.1152   918.993  408.775     8.0530    22.9589  266.202
teacherSc   1.2075    26.405    8.053     5.4481     0.5468   10.078
motherLev   0.1364    65.206   22.959     0.5468     1.6812   14.973
Y           2.5860   621.151  266.202    10.0776    14.9727  178.916

Eigenvalues:
[1] 3.319e+03 1.339e+02 8.687e+00 5.181e-01 2.193e-01 2.864e-02

Robust Distances: 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.8787   1.3510   1.8880  22.6200  16.6800 205.3000 
Robustness weights: 
 6 observations c(1,3,10,12,17,18) are outliers with |weight| = 0 ( < 0.005); 
 14 weights are ~= 1.
MV-BACON (subset no. 1): 10 of 20 (50 %)
MV-BACON (subset no. 2): 20 of 20 (100 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
MV-BACON (subset no. 1): 10 of 20 (50 %)
MV-BACON (subset no. 2): 20 of 20 (100 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
[1] TRUE
MV-BACON (subset no. 1): 14 of 20 (70 %)
MV-BACON (subset no. 2): 19 of 20 (95 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
MV-BACON (subset no. 4): 20 of 20 (100 %)
integer(0)

robustX documentation built on May 2, 2019, 5:16 p.m.