acc_multivariate_outlier: Calculate and plot Mahalanobis distances

View source: R/acc_multivariate_outlier.R

acc_multivariate_outlierR Documentation

Calculate and plot Mahalanobis distances


A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:

  • the classical approach from Tukey: 1.5 * IQR from the 1st (Q_{25}) or 3rd (Q_{75}) quartile.

  • the 6* \sigma approach, i.e. any measurement of the Mahalanobis distance not in the interval of \bar{x} \pm 3*\sigma is considered an outlier.

  • the approach from Hubert for skewed distributions which is embedded in the R package robustbase

  • a completely heuristic approach named \sigma-gap.

For further details, please see the vignette for univariate outlier.


  variable_group = NULL,
  id_vars = NULL,
  n_rules = 4,
  max_non_outliers_plot = 10000,
  criteria = c("tukey", "sixsigma", "hubert", "sigmagap"),



variable list the names of the continuous measurement variables building a group, for that multivariate outliers make sense.


variable optional, an ID variable of the study data. If not specified row numbers are used.


variable attribute the name of the column in the metadata with labels of variables


numeric from=1 to=4. the no. of rules that must be violated to classify as outlier


integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic.


set tukey | sixsigma | hubert | sigmagap. a vector with methods to be used for detecting outliers.


data.frame the data frame that contains the measurements


data.frame the data frame that contains metadata attributes of study data


a list with:

  • SummaryTable: data.frame underlying the plot

  • SummaryPlot: ggplot2 outlier plot

  • FlaggedStudyData data.frame contains the original data frame with the additional columns tukey, sixsigma, hubert, and sigmagap. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.


  • Implementation is restricted to variables of type float

  • Remove missing codes from the study data (if defined in the metadata)

  • The covariance matrix is estimated for all variables from variable_group

  • The Mahalanobis distance of each observation is calculated MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)

  • The four rules mentioned above are applied on this distance for each observation in the study data

  • An output data frame is generated that flags each outlier

  • A parallel coordinate plot indicates respective outliers

List function.

See Also

Online Documentation

dataquieR documentation built on July 26, 2023, 6:10 p.m.