View source: R/acc_multivariate_outlier.R
acc_multivariate_outlier | R Documentation |
A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:
the classical approach from Tukey: 1.5 * IQR
from the
1st (Q_{25}
) or 3rd (Q_{75}
) quartile.
the 3SD approach, i.e. any measurement of the Mahalanobis
distance not in the interval of \bar{x} \pm 3*\sigma
is considered an
outlier.
the approach from Hubert for skewed distributions which is embedded in the R package robustbase
a completely heuristic approach named \sigma
-gap.
For further details, please see the vignette for univariate outlier.
Indicator
acc_multivariate_outlier(
variable_group = NULL,
id_vars = NULL,
label_col,
n_rules = 4,
max_non_outliers_plot = 10000,
criteria = c("tukey", "3sd", "hubert", "sigmagap"),
study_data,
meta_data
)
variable_group |
variable list the names of the continuous measurement variables building a group, for that multivariate outliers make sense. |
id_vars |
variable optional, an ID variable of the study data. If not specified row numbers are used. |
label_col |
variable attribute the name of the column in the metadata with labels of variables |
n_rules |
numeric from=1 to=4. the no. of rules that must be violated to classify as outlier |
max_non_outliers_plot |
integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic. |
criteria |
set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers. |
study_data |
data.frame the data frame that contains the measurements |
meta_data |
data.frame the data frame that contains metadata attributes of study data |
a list with:
SummaryTable
: data.frame underlying the plot
SummaryPlot
: ggplot2 outlier plot
FlaggedStudyData
data.frame contains the original data frame with
the additional columns tukey
,
3SD
,
hubert
, and sigmagap
. Every
observation
is coded 0 if no outlier was detected in
the respective column and 1 if an
outlier was detected. This can be used
to exclude observations with outliers.
Implementation is restricted to variables of type float
Remove missing codes from the study data (if defined in the metadata)
The covariance matrix is estimated for all variables from variable_group
The Mahalanobis distance of each observation is calculated
MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)
The four rules mentioned above are applied on this distance for each observation in the study data
An output data frame is generated that flags each outlier
A parallel coordinate plot indicates respective outliers
List function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.