acc_multivariate_outlier: Function to calculate and plot Mahalanobis distances

View source: R/acc_multivariate_outlier.R

acc_multivariate_outlierR Documentation

Function to calculate and plot Mahalanobis distances

Description

A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:

  • the classical approach from Tukey: 1.5 * IQR from the 1st (Q_{25}) or 3rd (Q_{75}) quartile.

  • the 6* σ approach, i.e. any measurement of the Mahalanobis distance not in the interval of \bar{x} \pm 3*σ is considered an outlier.

  • the approach from Hubert for skewed distributions which is embedded in the R package robustbase

  • a completely heuristic approach named σ-gap.

For further details, please see the vignette for univariate outlier.

Usage

acc_multivariate_outlier(
  resp_vars,
  id_vars = NULL,
  label_col,
  n_rules = 4,
  study_data,
  meta_data
)

Arguments

resp_vars

variable list the name of the continuous measurement variables

id_vars

variable optional, an ID variable of the study data. If not specified row numbers are used.

label_col

variable attribute the name of the column in the metadata with labels of variables

n_rules

numeric from=1 to=4. the no. of rules that must be violated to classify as outlier

study_data

data.frame the data frame that contains the measurements

meta_data

data.frame the data frame that contains metadata attributes of study data

Value

a list with:

  • SummaryTable: data.frame underlying the plot

  • SummaryPlot: ggplot2 outlier plot

  • FlaggedStudyData data.frame contains the original data frame with the additional columns tukey, sixsigma, hubert, and sigmagap. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.

ALGORITHM OF THIS IMPLEMENTATION:

  • Implementation is restricted to variables of type float

  • Remove missing codes from the study data (if defined in the metadata)

  • The covariance matrix is estimated for all resp_vars

  • The Mahalanobis distance of each observation is calculated MD^2_i = (x_i - μ)^T Σ^{-1} (x_i - μ)

  • The four rules mentioned above are applied on this distance for each observation in the study data

  • An output data frame is generated that flags each outlier

  • A parallel coordinate plot indicates respective outliers

List function.

See Also

Online Documentation


dataquieR documentation built on Aug. 31, 2022, 5:08 p.m.