# mvBACON: BACON: Blocked Adaptive Computationally-Efficient Outlier... In robustX: 'eXtra' / 'eXperimental' Functionality for Robust Statistics

## Description

This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure.

## Usage

 ```1 2 3``` ```mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.95, init.sel = c("Mahalanobis", "dUniMedian", "random", "manual"), man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE) ```

## Arguments

 `x` numeric matrix (of dimension [n x p]), not supposed to contain missing values. `collect` a multiplication factor c, when `init.sel` is not `"manual"`, to define m, the size of the initial basic subset, as c * p, in practice, `m <- min(p * collect, n/2)`. `m` integer in `1:n` specifying the size of the initial basic subset; used only when `init.sel` is not `"manual"`. `alpha` significance level for the chisq cutoff, used to define the next iterations basic subset. `init.sel` character string, specifying the initial selection mode; implemented modes are: "Mahalanobis"based on Mahalanobis distances (default); the version V1 of the reference; affine invariant but not robust. "dUniMedian"based on the distances from the univariate medians; ; the version V2 of the reference; robust but not affine invariant. "random"based on a random selection, i.e., reproducible only via `set.seed()`. "manual"based on manual selection; in this case, a vector `man.sel` containing the indices of the selected observations must be specified. `"Mahalanobis"`, `"dUniMedian"` where proposed by Hadi and the other authors in the reference as versions ‘V_1’ and ‘V_2’, as well as `"manual"`, while `"random"` is provided in order to study the behaviour of BACON. `man.sel` only when `init.sel == "manual"`, the indices of observations determining the initial basic subset (and ```m <- length(man.sel)```). `maxsteps` maximal number of iteration steps. `allowSingular` logical indicating a solution should be sought also when no matrix of rank p is found. `verbose` logical indicating if messages are printed which trace progress of the algorithm.

## Value

a list with components

 `subset` logical vector of length `n` where the `i`-th entry is true iff the i-th observation is part of the final selection. `dis` numeric vector of length `n` with the (Mahalanobis) distances. `cov` p x p matrix, the corresponding robust estimate of covariance.

## Author(s)

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1. Port to R, testing etc, by Martin Maechler

## References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. doi: 10.1016/S0167-9473(99)00101-2

`covMcd` for a high-breakdown (but more computer intensive) method; `BACON` for a “generalization”, notably to regression.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24``` ``` require(robustbase) # for example data and covMcd(): ## simple 2D example : plot(starsCYG, main = "starsCYG data (n=47)") B.st <- mvBACON(starsCYG) points(starsCYG[ ! B.st\$subset,], pch = 4, col = 2, cex = 1.5) stopifnot(identical(which(!B.st\$subset), c(7L,9L,11L,14L,20L,30L,34L))) ## finds the clear outliers (and 3 "borderline") ## 'coleman' from pkg 'robustbase' coleman.x <- data.matrix(coleman[, 1:6]) Cc <- covMcd (coleman.x) # truly robust summary(Cc) # -> 6 outliers (1,3,10,12,17,18) Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm?? Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian") stopifnot(all.equal(Cb1, Cb2)) Cb.r <- lapply(1:20, function(i) { set.seed(i) mvBACON(coleman.x, init.sel="random", verbose=FALSE) }) nm <- names(Cb.r[]); nm <- nm[nm != "steps"] all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[][nm]))) # TRUE ## --> BACON always breaks down, i.e., does not see the outliers here ## breaks down even when manually starting with all the non-outliers: Cb.man <- mvBACON(coleman.x, init.sel = "manual", man.sel = setdiff(1:20, c(1,3,10,12,17,18))) which( ! Cb.man\$subset) # the outliers according to mvBACON : _none_ ```

### Example output ```Loading required package: robustbase
MV-BACON (subset no. 1): 8 of 47 (17.02 %)
MV-BACON (subset no. 2): 26 of 47 (55.32 %)
MV-BACON (subset no. 3): 34 of 47 (72.34 %)
MV-BACON (subset no. 4): 41 of 47 (87.23 %)
MV-BACON (subset no. 5): 40 of 47 (85.11 %)
MV-BACON (subset no. 6): 40 of 47 (85.11 %)
Minimum Covariance Determinant (MCD) estimator approximation.
Method: Fast MCD(alpha=0.5 ==> h=13); nsamp = 500; (n,k)mini = (300,5)
Call:
covMcd(x = coleman.x)
Log(Det.):  1.558

Robust Estimate of Location:
salaryP   fatherWc    sstatus  teacherSc  motherLev          Y
2.615     43.302      2.805     24.766      6.271     34.733
Robust Estimate of Covariance:
salaryP  fatherWc  sstatus  teacherSc  motherLev        Y
salaryP     0.5131     9.193    2.115     1.2075     0.1364    2.586
fatherWc    9.1930  2866.913  918.993    26.4050    65.2060  621.151
sstatus     2.1152   918.993  408.775     8.0530    22.9589  266.202
teacherSc   1.2075    26.405    8.053     5.4481     0.5468   10.078
motherLev   0.1364    65.206   22.959     0.5468     1.6812   14.973
Y           2.5860   621.151  266.202    10.0776    14.9727  178.916

Eigenvalues:
 3.319e+03 1.339e+02 8.687e+00 5.181e-01 2.193e-01 2.864e-02

Robust Distances:
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
0.8787   1.3510   1.8880  22.6200  16.6800 205.3000
Robustness weights:
6 observations c(1,3,10,12,17,18) are outliers with |weight| = 0 ( < 0.005);
14 weights are ~= 1.
MV-BACON (subset no. 1): 10 of 20 (50 %)
MV-BACON (subset no. 2): 20 of 20 (100 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
MV-BACON (subset no. 1): 10 of 20 (50 %)
MV-BACON (subset no. 2): 20 of 20 (100 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
 TRUE
MV-BACON (subset no. 1): 14 of 20 (70 %)
MV-BACON (subset no. 2): 19 of 20 (95 %)
MV-BACON (subset no. 3): 20 of 20 (100 %)
MV-BACON (subset no. 4): 20 of 20 (100 %)
integer(0)
```

robustX documentation built on May 2, 2019, 5:16 p.m.