# BACON: BACON for Regression or Multivariate Covariance Estimation In robustX: 'eXtra' / 'eXperimental' Functionality for Robust Statistics

## Description

BACON, short for ‘Blocked Adaptive Computationally-Efficient Outlier Nominators’, is a somewhat robust algorithm (set), with an implementation for regression or multivariate covariance estimation.

`BACON()` applies the multivariate (covariance estimation) algorithm, using `mvBACON(x)` in any case, and when `y` is not `NULL` adds a regression iteration phase, using the auxiliary `.lmBACON()` function.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10``` ```BACON(x, y = NULL, intercept = TRUE, m = min(collect * p, n * 0.5), init.sel = c("Mahalanobis", "dUniMedian", "random", "manual"), man.sel, init.fraction = 0, collect = 4, alpha = 0.95, maxsteps = 100, verbose = TRUE) ## *Auxiliary* function: .lmBACON(x, y, intercept = TRUE, init.dis, init.fraction = 0, collect = 4, alpha = 0.95, maxsteps = 100, verbose = TRUE) ```

## Arguments

 `x` a multivariate matrix of dimension [n x p] considered as containing no missing values. `y` the response (n vector) in the case of regression, or `NULL` for the multivariate case, where just `mvBACON()` is returned. `intercept` logical indicating if an intercept has to be used for the regression. `m` integer in `1:n` specifying the size of the initial basic subset; used only when `init.sel` is not `"manual"`; see `mvBACON`. `init.sel` character string, specifying the initial selection mode; see `mvBACON`. `man.sel` only when `init.sel == "manual"`, the indices of observations determining the initial basic subset (and ```m <- length(man.sel)```). `init.dis` the distances of the x matrix used for the initial subset determined by `mvBACON`. `init.fraction` if this parameter is > 0 then the tedious steps of selecting the initial subset are skipped and an initial subset of size n * init.fraction is chosen (with smallest dis) `collect` numeric factor chosen by the user to define the size of the initial subset (p * collect) `alpha` significance level. `maxsteps` the maximal number of iteration steps (to prevent infinite loops) `verbose` logical indicating if messages are printed which trace progress of the algorithm.

## Details

Notably about the initial selection mode, `init.sel`, see its description in the `mvBACON` arguments list.

## Value

`BACON(x,y,..)` (for regression) returns a `list` with components

 `subset` the observation indices (in `1:n`) denoting a subset of “good” supposedly outlier-free observations. `tis` the t[i](y[m],X[m]) of eq (6) in the reference; the clean “basic subset” in the algorithm is defined the observations i with the smallest |t[i]|, and the t[i] can be regarded as scaled predicted errors. `mv.dis` the (final) discrepancies or distances of `mvBACON()`. `mv.subset` the “good” subset from `mvBACON()`, used to start the regression iterations.

## Note

“BACON” was also chosen in honor of Francis Bacon:

Whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways.
Francis Bacon (1620), Novum Organum II 29.

## Author(s)

Ueli Oetliker, Swiss Federal Statistical Office, for S-plus 5.1; 25.05.2001; modified six times till 17.6.2001.

Port to R, testing etc, by Martin Maechler. Daniel Weeks (at pitt.edu) proposed a fix to a long standing buglet in `GiveTis()` computing the t[i], which was further improved Maechler, for robustX version 1.2-3 (Feb. 2019).

## References

Billor, N., Hadi, A. S., and Velleman , P. F. (2000). BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators; Computational Statistics and Data Analysis 34, 279–298. doi: 10.1016/S0167-9473(99)00101-2

`mvBACON`, the multivariate version of the BACON algorithm.

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17``` ```data(starsCYG, package = "robustbase") ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) abline(lmST, col = "gray") # least squares line str(B.ST <- with(starsCYG, BACON(x = log.Te, y = log.light))) ## 'subset': A good set of of points (to determine regression): colB <- adjustcolor(2, 1/2) points(log.light ~ log.Te, data = starsCYG, subset = B.ST\$subset, pch = 19, cex = 1.5, col = colB) ## A BACON-derived line: lmB <- lm(log.light ~ log.Te, data = starsCYG, subset = B.ST\$subset) abline(lmB, col = colB, lwd = 2) require(robustbase) (RlmST <- lmrob(log.light ~ log.Te, data = starsCYG)) abline(RlmST, col = "blue") ```

### Example output

```rank(ordered.x[1:m,] >= p  ==> chosen m =  4
MV-BACON (subset no. 1): 4 of 47 (8.51 %)
MV-BACON (subset no. 2): 5 of 47 (10.64 %)
MV-BACON (subset no. 3): 5 of 47 (10.64 %)
Reg-BACON (init subset no. 0): 8 of 47 (17.02 %)
Reg-BACON (init subset no. 0): 3 of 47 (6.38 %)
Reg-BACON (init subset no. 1): 4 of 47 (8.51 %)
Reg-BACON (init subset no. 2): 5 of 47 (10.64 %)
Reg-BACON (init subset no. 3): 6 of 47 (12.77 %)
Reg-BACON (init subset no. 4): 7 of 47 (14.89 %)
Reg-BACON (init subset no. 5): 8 of 47 (17.02 %)
Reg-BACON (subset no. 1): 8 of 47 (17.02 %)
List of 5
\$ subset   : logi [1:47] FALSE FALSE FALSE FALSE FALSE FALSE ...
\$ tis      : num [1:47] 7.2 3.39 8.26 3.39 11.38 ...
\$ mv.subset: logi [1:47] FALSE FALSE FALSE FALSE TRUE FALSE ...
\$ mv.dis   : num [1:47] 17.44 59.93 7.16 59.93 1.79 ...
\$ steps    : Named int [1:2] 3 1
..- attr(*, "names")= chr [1:2] "mv" "lm"