View source: R/find_outliers.R
| find_outliers | R Documentation |
Identifies groups (e.g., studies) that are most distant from the average group based on energy distance across multiple variables.
find_outliers(formula, data, cutoff = 0.99, R = 500, plot = TRUE)
formula |
A formula specifying the group variable and variables. e.g., 'study ~ var1 + var2 +...'. The group variable should be a factor or will be converted to one. |
data |
A data frame containing the variables specified in the formula. |
cutoff |
Numeric. Percentile threshold for the permutation-based cutoff (default 0.99). The cutoff is determined by permuting group labels and calculating the percentile of permuted median distances. |
R |
Integer. Number of permutations for determining the cutoff (default 500). |
plot |
Logical. If TRUE (default), returns a visualization of the outlier analysis. |
Groups with high median distance to other groups are identified as potential outliers. The outlier_score is a z-score that indicates how many standard deviations a group's median distance is from the overall median distance.
Before distance calculation, all covariates are scaled to mean 0 and standard deviation 1.
If 'plot = TRUE', returns a list with:
'cutoff_value': The permutation-based cutoff value used for outlier detection.
'summary': Data frame with group, median_distance, outlier_score, and is_outlier columns.
'heatmap': A ggplot2 heatmap of pairwise energy distances.
'barplot': A ggplot2 bar plot showing median distance to other groups.
If 'plot = FALSE', returns only the elements without plots.
# Example 1: 10 studies with real outliers (Study-8, Study-9, Study-10)
set.seed(123)
dat <- data.frame(
study = factor(rep(paste0("Study-", 1:10), each = 20)),
var1 = c(rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1),
rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 15, 1),
rnorm(20, 10, 1), rnorm(20, 16, 1)),
var2 = c(rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
rnorm(20, 10, 1), rnorm(20, 5, 1))
)
out <- find_outliers(study ~ var1 + var2, data = dat, R = 200)
out$summary # Study-8, Study-9, Study-10 should be flagged
out$cutoff_value # Permutation-based threshold
# Example 2: 20 studies with NO real outliers (all from same distribution)
set.seed(456)
dat_no_outliers <- data.frame(
study = factor(rep(paste0("Study-", 1:20), each = 15)),
var1 = rnorm(300, 10, 2),
var2 = rnorm(300, 5, 1)
)
out2 <- find_outliers(study ~ var1 + var2, data = dat_no_outliers, R = 200)
out2$summary # Should have few or no outliers flagged
sum(out2$is_outlier) # Count of flagged outliers (expected: 0 or very few)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.