find_outliers: Find Outlier Groups Based on Energy Distance
In eVCGsampler: VCG Sampling using Energy-Based Covariate Balancing

find_outliers

R Documentation

Find Outlier Groups Based on Energy Distance

Description

Identifies groups (e.g., studies) that are most distant from the average group based on energy distance across multiple variables.

Usage

find_outliers(formula, data, cutoff = 0.99, R = 500, plot = TRUE)

Arguments

`formula`	A formula specifying the group variable and variables. e.g., 'study ~ var1 + var2 +...'. The group variable should be a factor or will be converted to one.
`data`	A data frame containing the variables specified in the formula.
`cutoff`	Numeric. Percentile threshold for the permutation-based cutoff (default 0.99). The cutoff is determined by permuting group labels and calculating the percentile of permuted median distances.
`R`	Integer. Number of permutations for determining the cutoff (default 500).
`plot`	Logical. If TRUE (default), returns a visualization of the outlier analysis.

Details

Groups with high median distance to other groups are identified as potential outliers. The outlier_score is a z-score that indicates how many standard deviations a group's median distance is from the overall median distance.

Before distance calculation, all covariates are scaled to mean 0 and standard deviation 1.

Value

If 'plot = TRUE', returns a list with:

'cutoff_value': The permutation-based cutoff value used for outlier detection.
'summary': Data frame with group, median_distance, outlier_score, and is_outlier columns.
'heatmap': A ggplot2 heatmap of pairwise energy distances.
'barplot': A ggplot2 bar plot showing median distance to other groups.

If 'plot = FALSE', returns only the elements without plots.

Examples


# Example 1: 10 studies with real outliers (Study-8, Study-9, Study-10)
set.seed(123)
dat <- data.frame(
  study = factor(rep(paste0("Study-", 1:10), each = 20)),
  var1 = c(rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1),
           rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 15, 1),
           rnorm(20, 10, 1), rnorm(20, 16, 1)),
  var2 = c(rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
           rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
           rnorm(20, 10, 1), rnorm(20, 5, 1))
)
out <- find_outliers(study ~ var1 + var2, data = dat, R = 200)
out$summary      # Study-8, Study-9, Study-10 should be flagged
out$cutoff_value # Permutation-based threshold

# Example 2: 20 studies with NO real outliers (all from same distribution)
set.seed(456)
dat_no_outliers <- data.frame(
  study = factor(rep(paste0("Study-", 1:20), each = 15)),
  var1 = rnorm(300, 10, 2),
  var2 = rnorm(300, 5, 1)
)
out2 <- find_outliers(study ~ var1 + var2, data = dat_no_outliers, R = 200)
out2$summary     # Should have few or no outliers flagged
sum(out2$is_outlier)  # Count of flagged outliers (expected: 0 or very few)

eVCGsampler documentation built on March 10, 2026, 5:07 p.m.