Epidemic Algorithm for detection of multivariate outliers in incomplete survey data.

Share:

Description

In EAdet an epidemic is started at a center of the data. The epidemic spreads out and infects neighbouring points (probabilistically or deterministically). The last points infected are outliers. After running EAdet an imputation with EAimp may be run.

Usage

1
2
3
4
5
6
7
8
EAdet(data, weights, reach = "max", transmission.function = "root", power = ncol(data), 

distance.type = "euclidean", 
maxl = 5, plotting = TRUE, monitor = FALSE, 

prob.quantile = 0.9, random.start = FALSE, fix.start, threshold = FALSE, 

deterministic = TRUE, rm.missobs=FALSE,verbose=FALSE)

Arguments

data

a data frame or matrix with the data

weights

a vector of positive sampling weights

reach

if reach="max" the maximal nearest neighbour distance is used as the basis for the transmission function, otherwise the weighted (1-(p+1)/n) quantile of the nearest neighbour distances is used.

transmission.function

form of the transmission function of distance d: "step" is a heaviside function which jumps to 1 at d0, "linear" is linear between 0 and d0, "power" is (beta*d+1)^(-p) for p=ncol(data) as default, "root" is the function 1-(1-d/d0)^(1/maxl)

power

sets p=power

distance.type

distance type in function dist()

maxl

Maximum number of steps without infection

plotting

if TRUE the cdf of infection times is plotted

monitor

if TRUE verbose output on epidemic

prob.quantile

If mads fail take this quantile absolute deviation

random.start

If TRUE take a starting point at random instead of the spatial median

fix.start

Force epidemic to start at a specific observation

threshold

Infect all remaining points with infection probability above the threshold 1-0.5^(1/maxl)

deterministic

if TRUE the number of infections is the expected number and the infected observations are the ones with largest infection probabilities.

rm.missobs

Set rm.missobs=TRUE if completely missing observations should be discarded. This has to be done actively as a safeguard to avoid mismatches when imputing.

verbose

More output with verbose=TRUE.

Details

The form and parameters of the transmission function should be chosen such that the infection times have at least a range of 10. The default cutting point to decide on outliers is the median infection time plus three times the mad of infection times. A better cutpoint may be chosen by visual inspection of the cdf of infection times.

EAdet calls the function EA.dist, which passes the counterprobabilities of infection (an n*(n-1)/2 size vector!) and three parameters (sample spatial median index, maximal distance to nearest neighbor and transmission distance=reach) as arguments to EA.det. The distances vector may be too large to be passed as arguments. Then either the memory size must be increased. Former versions of the code used a global variable to store the distances in order to save memory.

Value

EAdet returns a list whose first component output is a sub-list with the following components:

sample.size

Number of observations

discarded.observations

Indices of discarded observations

missing.observations

Indices of completely missing observations

number.of.variables

Number of variables

n.complete.records

Number of records without missing values

n.usable.records

Number of records with less than half of values missing (unusable observations are discarded)

medians

Component wise medians

mads

Component wise mads

prob.quantile

Use this quantile if mads fail, i.e. if one of the mads is 0.

quantile.deviations

Quantile of absolute deviations.

start

Starting observation

transmission.function

Input parameter

power

Input parameter

maxl

Maximum number of steps without infection

min.nn.dist

maximal nearest neighbor distance

transmission.distance

d0

threshold

Input parameter

distance.type

Input parameter

deterministic

Input parameter

number.infected

Number of infected observations

cutpoint

Cutpoint of infection times for outlier definition

number.outliers

Number of outliers

outliers

Indices of outliers

duration

Duration of epidemic

computation.time

Elapsed computation time

initialisation.computation.time

Elapsed compuation time for standardisation and calculation of distance matrix

The further components returned by EAdet are:

infected

Indicator of infection

infection.time

Time of infection

outind

Indicator of outliers

Author(s)

Beat Hulliger

References

B\'eguin, C., and Hulliger, B. (2004). Multivariate oulier detection in incomplete survey data: The epidemic algorithm and transformed rank correlations. Journal of the Royal Statistical Society, A 167(Part 2.), 275-294.

See Also

EAimp for imputation with the Epidemic Algorithm.

Examples

1
2
3