bagdistance: Bagdistance of points relative to a dataset

View source: R/bagdistance.R

bagdistanceR Documentation

Bagdistance of points relative to a dataset

Description

Computes the bagdistance of p-dimensional points z relative to a p-dimensional dataset x. To compute the bagdistance of a point z_i first the bag of x is computed as the depth region containing the 50% observations (of x) with largest halfspace depth. Next, the ray from the halfspace median \theta through z_i is considered and c_z is defined as the intersection of this ray and the boundary of the bag. The bagdistance of z_i to x is then given by the ratio between the Euclidean distance of z_i to the halfspace median and the Euclidean distance of c_z to the halfspace median.

Usage

bagdistance(x, z = NULL, options = list())

Arguments

x

An n by p data matrix.

z

An optional m by p matrix containing rowwise the points z_i for which to compute the bagdistance. If z is not specified, it is set equal to x.

options

A list of available options:

  • approx
    In two dimensions one may choose to use an approximate algorithm or the exact algorithm to find the bag.
    Defaults to TRUE.

  • max.iter
    The maximum number of steps in the bisection algorithm to find the intersection point c_z (see Details).
    Defaults to 100.

  • All options may be specified that are passed to the hdepth function, see hdepth for details. Note that the option parameter approx is by default set to TRUE to save computation time.

Details

The bagdistance has been introduced in Hubert et al. (2015) and studied in Hubert et al. (2017). It does not assume symmetry and is affine invariant. Note that when the halfspace is not computed in an affine invariant way, the bagdistance cannot be affine invariant either.

The function first computes the halfspace depth and the halfspace median of x. Additional options may be passed to the hdepth routine by specifying them in the option list argument.

It is first checked whether the data lie in a subspace of dimension smaller than p. If so, a warning is given, as well as the dimension of the subspace and a direction which is orthogonal to it.

Depending on the dimensions different algorithms are used. For p=1 the bagdistance is computed exactly. For p=2 the default setting (options$approx=TRUE) uses an approximated algorithm. Exact computation, based on the exact algoritm to compute the contours of the bag (see the depthContour function), is obtained by setting options$approx to FALSE. Note that this may lead to an increase in computation time.

For the approximated algorithm, the intersection point c_z is approximated by searching on each ray the point whose depth is equal to the median of the depth values of x. As the halfspace depth is monotone decreasing along the ray, a bisection algorithm is used. Starting limits are obtained by projecting the data on the direction and considering the data point with univariate depth corresponding to the median of the halfspace depths of x. By definition the multivariate depth of this point has to be lower or equal than its univariate depth. A second limit is obtained by considering the deepest location estimate. The maximum number of iterations bisecting the current search interval can be specified through the options argument max.iter.

An observation from z is flagged as an outlier if its bagdistance exceeds a cutoff value. This cutoff is equal to the squareroot of the 0.99 quantile of the chi-squared distribution with p degrees of freedom.

Value

A list with components:

bagdistance

The bagdistance of the points of z with respect to the data matrix x.

cutoff

Points of z whose bagdistance exceeds this cutoff can be considered as outliers with respect to x.

flag

Points of z whose bagdistance exceeds the cutoff receive a flag equal to FALSE, otherwise they receive a flag TRUE.

converged

Vector of length m indicating for each point of z whether the bisection algorithm converged within the maximum number of steps specified by max.iter in the options list.

dimension

When the data x are lying in a lower dimensional subspace, the dimension of this subspace.

hyperplane

When the data x are lying in a lower dimensional subspace, a direction orthogonal to this subspace.

Author(s)

P. Segaert.

References

Hubert M., Rousseeuw P.J., Segaert P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24, 177–202.

Hubert M., Rousseeuw P.J., Segaert P. (2017). Multivariate and functional classification using depth and distance. Advances in Data Analysis and Classification, 11, 445–466.

See Also

depthContour, hdepth, bagplot

Examples

# Generate some bivariate data
set.seed(5)
nObs <- 500
XS <- matrix(rnorm(nObs * 2), nrow = nObs, ncol = 2)
A <- matrix(c(1,1,.5,.1), ncol = 2, nrow = 2)
X <- XS %*% A

# In two dimensions we may either use the approximate
# or the exact algorithm to compute the bag.
respons.exact <- bagdistance(x = X, options = list(approx = FALSE))
respons.approx <- bagdistance(x = X, options = list(approx = TRUE))
# Both algorithms yield fairly similar results.
plot(respons.exact$bagdistance, respons.approx$bagdistance)
abline(a = 0, b = 1)

# In Hubert et al. (2015) it was shown that for elliptical
# distributions the squared bagdistance relates to the 
# squared Mahalanobis distances. This may be easily illustrated.
mahDist <- mahalanobis(x = X, colMeans(X), cov(X))
plot(respons.exact$bagdistance^2, mahDist)

# Computation of the bagdistance relies on the computation
# of halfspace depth using the hdepth function. Options for
# the hdepth routine can be passed down using the options
# arguments. Note that the bagdistance is only affine invariant
# if the halfspace depth is computed in an affine invariant way. 
options <-list(type = "Rotation",
               ndir = 375,
               approx = TRUE,
               seed = 78341)
respons.approx.rot <- bagdistance(x = X, options = options)
plot(respons.exact$bagdistance, respons.approx.rot$bagdistance)
abline(a = 0, b = 1)

mrfDepth documentation built on May 29, 2024, 5:04 a.m.