# BACON-EEM Algorithm for multivariate outlier detection in incomplete multivariate survey data

### Description

BEM starts from a set of uncontaminated data with possible missing values, applies a version of the EM-algorithm to estimate the center and scatter of the good data, then adds (or deletes) observations to the good data which have a Mahalanobis distance below a threshold. This process iterates until the good data remain stable. Observations not among the good data are outliers.

### Usage

1 2 |

### Arguments

`data` |
a matrix or data frame. As usual, rows are observations and columns are variables. |

`weights` |
a non-negative and non-zero vector of weights for each observation.
Its length must equal the number of rows of the data. Default is |

`v` |
an integer indicating the distance for the definition of the starting good subset: v=1 uses the Mahalanobis distance based on the weighted mean and covariance, v=2 uses the Euclidean distance from the componentwise median |

`c0` |
the size of initial subset is c0*ncol(data). |

`alpha` |
a small probability indicating the level |

`md.type` |
Type of Mahalanobis distance: "m" marginal, "c" conditional |

`em.steps.start` |
Number of iterations of EM-algorithm for starting good subset |

`em.steps.loop` |
Number of iterations of EM-algorithm for good subset |

`better.estimation` |
If |

`monitor` |
If |

### Details

The BACON algorithm with `v=1`

is not robust but affine equivariant while `v=1`

is robust but not affine equivariant. The threshold for the (squared) Mahalanobis distances, beyond which an observation is an outlier, is
a standardised chisquare quantile at `(1-alpha)`

. For large data sets it may be better to choose `alpha/n`

instead.

The internal function `.EM.normal`

is usually called from `BEM`

. `.EM.normal`

is implementing the EM-algorithm in such a way that part of the calculations can be saved to be reused in the BEM algorithm. `.EM.normal`

does not contain the computation of the observed sufficient statistics, they will be computed in the main program of `BEM`

and passed as parameters as well as the statistics on the missingness patterns.

### Value

`BEM`

returns a list whose first component is the sub-list `output`

with the following components:

`sample.size ` |
number of observations |

`discarded.observations` |
Number of discarded observations |

`number.of.variables ` |
Number of variables |

`significance.level` |
the probability used for the cutpoint, i.e.\ |

`initial.basic.subset.size` |
Size of initial good subset |

`final.basic.subset.size` |
Size of final good subset |

`number.of.iterations` |
Number of iterations of the BACON step |

`computation.time` |
Elapsed computation time |

`center` |
Final estimate of the center |

`scatter` |
Final estimate of the covariance matrix |

`cutpoint` |
The threshold MD-value for the cut-off of outliers |

The further components returned by `BEM`

are:

`outind` |
Outlier indicator |

`dist` |
Final Mahalanobis distances |

### Note

BEM uses an adapted version of the EM-algorithm in funkction `EM-normal.`

### Author(s)

Beat Hulliger

### References

B\'eguin, C. and Hulliger, B. (2008) The BACON-EEM Algorithm for Multivariate Outlier Detection
in Incomplete Survey Data, *Survey Methodology*, Vol. 34, No. 1, pp. 91-103.

Billor, N., Hadi, A.S. and Vellemann, P.F. (2000). BACON: Blocked
Adaptative Computationally-efficient Outlier Nominators. *Computational Statistics and Data Analysis*,
34(3), 279-298.

Schafer J.L. (2000),
*Analysis of Incomplete Multivariate Data*, Monographs on Statistics and Applied Probability 72,
Chapman & Hall.

### Examples

1 2 3 4 | ```
# Bushfire data set with 20% MCAR
data(bushfirem,bushfire.weights)
bem.res<-BEM(bushfirem,bushfire.weights,alpha=(1-0.01/nrow(bushfirem)))
print(bem.res$output)
``` |

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker. Vote for new features on Trello.