Epidemic Algorithm for detection of multivariate outliers in incomplete survey data.
Description
In EAdet
an epidemic is started at a center of the data.
The epidemic spreads out and infects neighbouring points (probabilistically or deterministically).
The last points infected are outliers. After running EAdet
an imputation with EAimp
may be run.
Usage
1 2 3 4 5 6 7 8 
Arguments
data 
a data frame or matrix with the data 
weights 
a vector of positive sampling weights 
reach 
if 
transmission.function 
form of the transmission function of distance 
power 
sets 
distance.type 
distance type in function 
maxl 
Maximum number of steps without infection 
plotting 
if 
monitor 
if 
prob.quantile 
If mads fail take this quantile absolute deviation 
random.start 
If 
fix.start 
Force epidemic to start at a specific observation 
threshold 
Infect all remaining points with infection probability above the threshold 
deterministic 
if 
rm.missobs 
Set 
verbose 
More output with 
Details
The form and parameters of the transmission function should be chosen such that the infection times have at least a range of 10. The default cutting point to decide on outliers is the median infection time plus three times the mad of infection times. A better cutpoint may be chosen by visual inspection of the cdf of infection times.
EAdet
calls the function EA.dist
, which passes the counterprobabilities of infection (an n*(n1)/2 size vector!) and three parameters (sample spatial median index, maximal distance to nearest neighbor and transmission distance=reach) as arguments to EA.det
. The distances vector may be too large to be passed as arguments. Then either the memory size must be increased. Former versions of the code used a global variable to store the distances in order to save memory.
Value
EAdet
returns a list whose first component output
is a sublist with the following components:
sample.size 
Number of observations 
discarded.observations 
Indices of discarded observations 
missing.observations 
Indices of completely missing observations 
number.of.variables 
Number of variables 
n.complete.records 
Number of records without missing values 
n.usable.records 
Number of records with less than half of values missing (unusable observations are discarded) 
medians 
Component wise medians 
mads 
Component wise mads 
prob.quantile 
Use this quantile if mads fail, i.e. if one of the mads is 0. 
quantile.deviations 
Quantile of absolute deviations. 
start 
Starting observation 
transmission.function 
Input parameter 
power 
Input parameter 
maxl 
Maximum number of steps without infection 
min.nn.dist 
maximal nearest neighbor distance 
transmission.distance 

threshold 
Input parameter 
distance.type 
Input parameter 
deterministic 
Input parameter 
number.infected 
Number of infected observations 
cutpoint 
Cutpoint of infection times for outlier definition 
number.outliers 
Number of outliers 
outliers 
Indices of outliers 
duration 
Duration of epidemic 
computation.time 
Elapsed computation time 
initialisation.computation.time 
Elapsed compuation time for standardisation and calculation of distance matrix 
The further components returned by EAdet
are:
infected 
Indicator of infection 
infection.time 
Time of infection 
outind 
Indicator of outliers 
Author(s)
Beat Hulliger
References
B\'eguin, C., and Hulliger, B. (2004). Multivariate oulier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations. Journal of the Royal Statistical Society, A 167(Part 2.), 275294.
See Also
EAimp
for imputation with the Epidemic Algorithm.
Examples
1 2 3  data(bushfirem,bushfire.weights)
det.res<EAdet(bushfirem,bushfire.weights)
print(det.res$output)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker. Vote for new features on Trello.