find_outliers: Find Outlier Groups Based on Energy Distance

View source: R/find_outliers.R

find_outliersR Documentation

Find Outlier Groups Based on Energy Distance

Description

Identifies groups (e.g., studies) that are most distant from the average group based on energy distance across multiple variables.

Usage

find_outliers(formula, data, cutoff = 0.99, R = 500, plot = TRUE)

Arguments

formula

A formula specifying the group variable and variables. e.g., 'study ~ var1 + var2 +...'. The group variable should be a factor or will be converted to one.

data

A data frame containing the variables specified in the formula.

cutoff

Numeric. Percentile threshold for the permutation-based cutoff (default 0.99). The cutoff is determined by permuting group labels and calculating the percentile of permuted median distances.

R

Integer. Number of permutations for determining the cutoff (default 500).

plot

Logical. If TRUE (default), returns a visualization of the outlier analysis.

Details

Groups with high median distance to other groups are identified as potential outliers. The outlier_score is a z-score that indicates how many standard deviations a group's median distance is from the overall median distance.

Before distance calculation, all covariates are scaled to mean 0 and standard deviation 1.

Value

If 'plot = TRUE', returns a list with:

  • 'cutoff_value': The permutation-based cutoff value used for outlier detection.

  • 'summary': Data frame with group, median_distance, outlier_score, and is_outlier columns.

  • 'heatmap': A ggplot2 heatmap of pairwise energy distances.

  • 'barplot': A ggplot2 bar plot showing median distance to other groups.

If 'plot = FALSE', returns only the elements without plots.

Examples


# Example 1: 10 studies with real outliers (Study-8, Study-9, Study-10)
set.seed(123)
dat <- data.frame(
  study = factor(rep(paste0("Study-", 1:10), each = 20)),
  var1 = c(rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1),
           rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 10, 1), rnorm(20, 15, 1),
           rnorm(20, 10, 1), rnorm(20, 16, 1)),
  var2 = c(rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
           rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1), rnorm(20, 5, 1),
           rnorm(20, 10, 1), rnorm(20, 5, 1))
)
out <- find_outliers(study ~ var1 + var2, data = dat, R = 200)
out$summary      # Study-8, Study-9, Study-10 should be flagged
out$cutoff_value # Permutation-based threshold

# Example 2: 20 studies with NO real outliers (all from same distribution)
set.seed(456)
dat_no_outliers <- data.frame(
  study = factor(rep(paste0("Study-", 1:20), each = 15)),
  var1 = rnorm(300, 10, 2),
  var2 = rnorm(300, 5, 1)
)
out2 <- find_outliers(study ~ var1 + var2, data = dat_no_outliers, R = 200)
out2$summary     # Should have few or no outliers flagged
sum(out2$is_outlier)  # Count of flagged outliers (expected: 0 or very few)


eVCGsampler documentation built on March 10, 2026, 5:07 p.m.