View source: R/mahalanobis_distance.R
mahalanobis_distance | R Documentation |
Pipe-friendly wrapper around to the function
mahalanobis()
, which returns the squared
Mahalanobis distance of all rows in x. Compared to the base function, it
automatically flags multivariate outliers.
Mahalanobis distance is a common metric used to identify multivariate outliers. The larger the value of Mahalanobis distance, the more unusual the data point (i.e., the more likely it is to be a multivariate outlier).
The distance tells us how far an observation is from the center of the cloud, taking into account the shape (covariance) of the cloud as well.
To detect outliers, the calculated Mahalanobis distance is compared against a chi-square (X^2) distribution with degrees of freedom equal to the number of dependent (outcome) variables and an alpha level of 0.001.
The threshold to declare a multivariate outlier is determined using the
function qchisq(0.999, df)
, where df is the degree of freedom (i.e.,
the number of dependent variable used in the computation).
mahalanobis_distance(data, ...)
data |
a data frame. Columns are variables. |
... |
One unquoted expressions (or variable name). Used to select a
variable of interest. Can be also used to ignore a variable that are not
needed for the computation. For example specify |
Returns the input data frame with two additional columns: 1) "mahal.dist": Mahalanobis distance values; and 2) "is.outlier": logical values specifying whether a given observation is a multivariate outlier
# Compute mahalonobis distance and flag outliers if any iris %>% doo(~mahalanobis_distance(.)) # Compute distance by groups and filter outliers iris %>% group_by(Species) %>% doo(~mahalanobis_distance(.)) %>% filter(is.outlier == TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.