View source: R/sociophonetics.R
tidy_mahalanobis | R Documentation |
This is a tidyverse-tidyverse-compatible version of the stats::mahalanobis
function. It just
makes it easier to include it as part of a dplyr::mutate
.
tidy_mahalanobis(...)
... |
Names of columns that should be included in the Mahalanobis distance. For vowel data, this is typically your F1 and F2 columns. |
Typically you'll want to group your data (using dplyr::group_by
) by speaker and vowel
class so that you get the distance from vowel centroids.
I won't tell you what to do with those distances, but if you might consider looking at tokens where the square root of the Mahalanobis distance is greater than around 2. However, to be clear, the exact cutoff will vary depending on the size and variability of your data. You can see how you might isolate these points visually in the example code below.
One small modification that this function does that stats::mahalanobis
does not do is that
if there are fewer than 5 measurements in a group, tidy_mahalanobis
returns them all
as having a distance of zero. I found that this prevents some fatal errors from crashing the script
when running this function on smaller datasets.
Note that this function requires the MASS
package to be installed to work, but you
don't need to load it.
A vector that contains the Mahalanobis distances for each observation.
suppressPackageStartupMessages(library(tidyverse))
df <- joeysvowels::midpoints
# Calculate the distances
m_dists <- df %>%
group_by(vowel) %>%
mutate(mahal_dist = tidy_mahalanobis(F1, F2))
# Take a peek at the resulting dataset
m_dists %>%
select(vowel, F1, F2, mahal_dist) %>%
head()
# Plot potential outliers
ggplot(m_dists, aes(F2, F1, color = sqrt(mahal_dist) > 2)) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
# You can include whatever numeric variables you want, like duration
df %>%
group_by(vowel) %>%
mutate(dur = end - start) %>%
mutate(mahal_dist = tidy_mahalanobis(F1, F2, dur)) %>%
ggplot(aes(F2, F1, color = sqrt(mahal_dist) > 2.5)) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
# Data cannot contain NAs. Remove them before running.
df[1,]$F1 <- NA
df %>%
group_by(vowel) %>%
mutate(mahal_dist = tidy_mahalanobis(F1, F2))
df %>%
group_by(vowel) %>%
filter(!is.na(F1)) %>%
mutate(mahal_dist = tidy_mahalanobis(F1, F2)) %>%
select(vowel_id, vowel, mahal_dist, F1, F2) %>%
head()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.