# EAdet: Epidemic Algorithm for detection of multivariate outliers in... In modi: Multivariate outlier detection and imputation for incomplete survey data

## Description

In `EAdet` an epidemic is started at a center of the data. The epidemic spreads out and infects neighbouring points (probabilistically or deterministically). The last points infected are outliers. After running `EAdet` an imputation with `EAimp` may be run.

## Usage

 ```1 2 3 4 5 6 7 8``` ```EAdet(data, weights, reach = "max", transmission.function = "root", power = ncol(data), distance.type = "euclidean", maxl = 5, plotting = TRUE, monitor = FALSE, prob.quantile = 0.9, random.start = FALSE, fix.start, threshold = FALSE, deterministic = TRUE, rm.missobs=FALSE,verbose=FALSE) ```

## Arguments

 `data` a data frame or matrix with the data `weights` a vector of positive sampling weights `reach` if `reach="max"` the maximal nearest neighbour distance is used as the basis for the transmission function, otherwise the weighted `(1-(p+1)/n)` quantile of the nearest neighbour distances is used. `transmission.function` form of the transmission function of distance `d`: `"step"` is a heaviside function which jumps to `1` at `d0`, `"linear"` is linear between `0` and `d0`, `"power"` is `(beta*d+1)^(-p)` for `p=ncol(data)` as default, `"root"` is the function `1-(1-d/d0)^(1/maxl)` `power` sets `p=power` `distance.type` distance type in function `dist()` `maxl` Maximum number of steps without infection `plotting` if `TRUE` the cdf of infection times is plotted `monitor` if `TRUE` verbose output on epidemic `prob.quantile` If mads fail take this quantile absolute deviation `random.start` If `TRUE` take a starting point at random instead of the spatial median `fix.start` Force epidemic to start at a specific observation `threshold` Infect all remaining points with infection probability above the threshold `1-0.5^(1/maxl)` `deterministic` if `TRUE` the number of infections is the expected number and the infected observations are the ones with largest infection probabilities. `rm.missobs` Set `rm.missobs=TRUE` if completely missing observations should be discarded. This has to be done actively as a safeguard to avoid mismatches when imputing. `verbose` More output with `verbose=TRUE`.

## Details

The form and parameters of the transmission function should be chosen such that the infection times have at least a range of 10. The default cutting point to decide on outliers is the median infection time plus three times the mad of infection times. A better cutpoint may be chosen by visual inspection of the cdf of infection times.

`EAdet` calls the function `EA.dist`, which passes the counterprobabilities of infection (an n*(n-1)/2 size vector!) and three parameters (sample spatial median index, maximal distance to nearest neighbor and transmission distance=reach) as arguments to `EA.det`. The distances vector may be too large to be passed as arguments. Then either the memory size must be increased. Former versions of the code used a global variable to store the distances in order to save memory.

## Value

`EAdet` returns a list whose first component `output` is a sub-list with the following components:

 `sample.size` Number of observations `discarded.observations` Indices of discarded observations `missing.observations` Indices of completely missing observations `number.of.variables` Number of variables `n.complete.records` Number of records without missing values `n.usable.records` Number of records with less than half of values missing (unusable observations are discarded) `medians` Component wise medians `mads` Component wise mads `prob.quantile` Use this quantile if mads fail, i.e. if one of the mads is 0. `quantile.deviations` Quantile of absolute deviations. `start` Starting observation `transmission.function` Input parameter `power` Input parameter `maxl` Maximum number of steps without infection `min.nn.dist` maximal nearest neighbor distance `transmission.distance` `d0` `threshold` Input parameter `distance.type` Input parameter `deterministic` Input parameter `number.infected` Number of infected observations `cutpoint` Cutpoint of infection times for outlier definition `number.outliers` Number of outliers `outliers` Indices of outliers `duration` Duration of epidemic `computation.time` Elapsed computation time `initialisation.computation.time` Elapsed compuation time for standardisation and calculation of distance matrix

The further components returned by `EAdet` are:

 `infected` Indicator of infection `infection.time` Time of infection `outind` Indicator of outliers

Beat Hulliger

## References

B\'eguin, C., and Hulliger, B. (2004). Multivariate oulier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations. Journal of the Royal Statistical Society, A 167(Part 2.), 275-294.

`EAimp` for imputation with the Epidemic Algorithm.
 ```1 2 3``` ```data(bushfirem,bushfire.weights) det.res<-EAdet(bushfirem,bushfire.weights) print(det.res\$output) ```