kernelDist: Kernel Density Distance
In NESCent/MINOTAUR: MultIvariate visualisatioN and OuTlier Analysis Using R

Description Usage Arguments Details Author(s) Examples

Calculates kernel density of all points from all others in multivariate space. Returns -2*log(density) as a distance measure. Data are subset prior to calculating distances (see details).

1 2	kernelDist(dfv, column.nums = 1:ncol(dfv), subset = 1:nrow(dfv), bandwidth = "default", S = NULL)

`dfv`	a data frame containing observations in rows and statistics in columns.
`column.nums`	indexes the columns of the data frame that will be used to calculate kernel density distances (all other columns are ignored).
`subset`	index the rows of the data frame that will be used to calculate the covariance matrix (unless specified manually).
`bandwidth`	standard deviation of the normal kernel in each dimension. Can be a numerical value, or can be set to 'default', in which case Silverman's rule is used to select the bandwidth.
`S`	the covariance matrix that the bandwidth is multiplied by. Leave as NULL to use the ordinary covariance matrix calculated using cov(dfv[subset,column.nums]).

Takes a matrix or data frame as input, with observations in rows and statistics in columns. The parameter "column.nums" is used to select which columns to use in the analysis, all other columns are ignored. The covariance is then calculated on a subset of this data, specified using the parameter "subset" (which defaults to all observations). The kernel bandwidth is multiplied by this covariance matrix. Alternatively, this matrix can be specified manually as an additional argument. The kernel density deviance of a point is calculated as -2*log(density) of this point from all other points in the chosen subset. Assumes a multivariate normal kernel with the same user-defined bandwidth in all dimensions (after normalization).

Note that this method cannot handle NA values.

Robert Verity r.verity@imperial.ac.uk

## Not run: 
# create a data frame of observations
df <- data.frame(x=rnorm(100),y=rnorm(100))

# calculate kernel density distances
distances <- kernelDist(df)

# use this distance to look for outliers
Q95 <- quantile(distances, 0.95)
which(distances>Q95)

## End(Not run)