mvBACON: BACON: Blocked Adaptive Computationally-Efficient Outlier...

View source: R/BACON-alg.R

mvBACONR Documentation

BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators

Description

This function performs an outlier identification algorithm to the data in the x array [n x p] and y vector [n] following the lines described by Hadi et al. for their BACON outlier procedure.

Usage

mvBACON(x, collect = 4, m = min(collect * p, n * 0.5), alpha = 0.05,
        init.sel = c("Mahalanobis", "dUniMedian", "random", "manual", "V2"),
        man.sel, maxsteps = 100, allowSingular = FALSE, verbose = TRUE)

Arguments

x

numeric matrix (of dimension [n x p]), not supposed to contain missing values.

collect

a multiplication factor c, when init.sel is not "manual", to define m, the size of the initial basic subset, as m := c \cdot p, in practice, m <- min(p * collect, n/2).

m

integer in 1:n specifying the size of the initial basic subset; used only when init.sel is not "manual".

alpha

determines the cutoff value for the Mahalanobis distances (see details).

init.sel

character string, specifying the initial selection mode; implemented modes are:

"Mahalanobis"

based on Mahalanobis distances (default); the version V1 of the reference; affine invariant but not robust.

"dUniMedian"

based on the distances from the univariate medians; similar to the version V2 of the reference; robust but not affine invariant.

"random"

based on a random selection, i.e., reproducible only via set.seed().

"manual"

based on manual selection; in this case, a vector man.sel containing the indices of the selected observations must be specified.

"V2"

based on the Euclidean norm from the univariate medians; this is the version V2 of the reference; robust but not affine invariant.

"Mahalanobis" and "V2" where proposed by Hadi and the other authors in the reference as versions ‘V_1’ and ‘V_2’, as well as "manual", while "random" is provided in order to study the behaviour of BACON. Option "dUniMedian" is similar to "V2" and is due to U. Oetliker.

man.sel

only when init.sel == "manual", the indices of observations determining the initial basic subset (and m <- length(man.sel)).

maxsteps

maximal number of iteration steps.

allowSingular

logical indicating a solution should be sought also when no matrix of rank p is found.

verbose

logical indicating if messages are printed which trace progress of the algorithm.

Details

Remarks on the tuning parameter alpha: Let \chi^2_p be a chi-square distributed random variable with p degrees of freedom (p is the number of variables; n is the number of observations). Denote the (1-\alpha) quantile by \chi^2_p(\alpha), e.g., \chi^2_p(0.05) is the 0.95 quantile. Following Billor et al. (2000), the cutoff value for the Mahalanobis distances is defined as \chi_p(\alpha/n) (the square root of chi^2_p) times a correction factor c(n,p), n and p, and they use \alpha=0.05.

Value

a list with components

subset

logical vector of length n where the i-th entry is true iff the i-th observation is part of the final selection.

dis

numeric vector of length n with the (Mahalanobis) distances.

cov

p \times p matrix, the corresponding robust estimate of covariance.

Author(s)

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1. Port to R, testing etc, by Martin Maechler; Init selection "V2" and correction of default alpha from 0.95 to 0.05, by Tobias Schoch, FHNW Olten, Switzerland.

References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1016/S0167-9473(99)00101-2")}

See Also

covMcd for a high-breakdown (but more computer intensive) method; BACON for a “generalization”, notably to regression.

Examples

 require(robustbase) # for example data and covMcd():
 ## simple 2D example :
 plot(starsCYG, main = "starsCYG  data  (n=47)")
 B.st <- mvBACON(starsCYG)
 points(starsCYG[ ! B.st$subset,], pch = 4, col = 2, cex = 1.5)
 stopifnot(identical(which(!B.st$subset), c(7L,11L,20L,30L,34L)))
 ## finds the 4 clear outliers (and 1 "borderline");
 ## it does not find obs. 14 which is an outlier according to covMcd(.)

 iniS <- setNames(, eval(formals(mvBACON)$init.sel)) # all initialization methods, incl "random"
 set.seed(123)
 Bs.st <- lapply(iniS[iniS != "manual"], function(s)
                 mvBACON(as.matrix(starsCYG), init.sel = s, verbose=FALSE))
 ii <- - match("steps", names(Bs.st[[1]]))
 Bs.s1 <- lapply(Bs.st, `[`, ii)
 stopifnot(exprs = {
    length(Bs.s1) >= 4
    length(unique(Bs.s1)) == 1 # all 4 methods give the same
 })

 ## Example where "dUniMedian" and "V2" differ :
 data(pulpfiber, package="robustbase")
 dU.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "dUniMedian")
 V2.plp <- mvBACON(as.matrix(pulpfiber), init.sel = "V2")
 (oU <- which(! dU.plp$subset))
 (o2 <- which(! V2.plp$subset))
 stopifnot(setdiff(o2, oU) %in% c(57L,58L,59L,62L))
 ## and 57, 58, 59, and 62 *are* outliers according to covMcd(.)

 ## 'coleman' from pkg 'robustbase'
 coleman.x <- data.matrix(coleman[, 1:6])
 Cc <- covMcd (coleman.x) # truly robust
 summary(Cc) # -> 6 outliers (1,3,10,12,17,18)
 Cb1 <- mvBACON(coleman.x) ##-> subset is all TRUE hmm??
 Cb2 <- mvBACON(coleman.x, init.sel = "dUniMedian")
 stopifnot(all.equal(Cb1, Cb2))
 ## try 20 different random starts:
 Cb.r <- lapply(1:20, function(i) { set.seed(i)
                     mvBACON(coleman.x, init.sel="random", verbose=FALSE) })
 nm <- names(Cb.r[[1]]); nm <- nm[nm != "steps"]
 all(eqC <- sapply(Cb.r[-1], function(CC) all.equal(CC[nm], Cb.r[[1]][nm]))) # TRUE
 ## --> BACON always  breaks down, i.e., does not see the outliers here
 
 ## breaks down even when manually starting with all the non-outliers:
 Cb.man <- mvBACON(coleman.x, init.sel = "manual",
                   man.sel = setdiff(1:20, c(1,3,10,12,17,18)))
 which( ! Cb.man$subset) # the outliers according to mvBACON : _none_

robustX documentation built on July 9, 2023, 3:07 p.m.