get_simpsons_paradox_d: Introduce Simpson's Paradox in Discrete Data

View source: R/get_simpsons_paradox_d.R

get_simpsons_paradox_dR Documentation

Introduce Simpson's Paradox in Discrete Data

Description

This function modifies contingency tables associated with different levels of a categorical variable to create or highlight Simpson's Paradox using simulated annealing. The paradox occurs when aggregated data trends differ from subgroup trends.

Usage

get_simpsons_paradox_d(
  x,
  y,
  z,
  manual_vec,
  target_overall,
  margin,
  margin_overall,
  max_n = 1000,
  temp = 10,
  log_odds_general = log_odds_dc
)

Arguments

x

A vector of categorical values for the first variable.

y

A vector of categorical values for the second variable.

z

A vector indicating levels of a third variable that segments the data.

manual_vec

A numeric vector specifying target log-odds trends for each level of z.

target_overall

A numeric value representing the target log-odds for the aggregated data.

margin

A numeric value for allowed deviation in log-odds within each subgroup.

margin_overall

A numeric value for allowed deviation in aggregated log-odds.

max_n

An integer specifying the maximum number of iterations for the annealing process.

temp

A numeric value for the initial temperature in the annealing process.

log_odds_general

A function to compute the log-odds for a given contingency table (default: log_odds_dc).

Details

This function works by iteratively modifying individual matrices (contingency tables) corresponding to levels of z while respecting log-odds constraints. The overall log-odds of the aggregated table are also adjusted to achieve the specified target_overall. Simulated annealing ensures that the modifications balance between achieving the targets and avoiding overfitting.

Value

A list containing:

  • final_df: A data frame representing the modified dataset.

  • final_table: A list of modified contingency tables.

  • history: A data frame tracking the overall log-odds over iterations.

Examples

# Example with predefined contingency tables
set.seed(42)
matrices <- list(
  ta = matrix(c(512, 89, 313, 19), ncol = 2, byrow = TRUE),
  tb = matrix(c(353, 17, 207, 8), ncol = 2, byrow = TRUE),
  tc = matrix(c(120, 202, 205, 391), ncol = 2, byrow = TRUE)
)
df_list <- lapply(seq_along(matrices), function(i) {
  mat <- matrices[[i]]
  z_level <- names(matrices)[i]
  df <- as.data.frame(as.table(mat))
  colnames(df) <- c("x", "y", "Freq")
  df$z <- z_level
  return(df)
})
final_df <- do.call(rbind, df_list)
expanded_df <- final_df[rep(1:nrow(final_df), final_df$Freq), c("x", "y", "z")]
result <- get_simpsons_paradox_d(
  expanded_df$x, expanded_df$y, expanded_df$z,
  manual_vec = c(-1, -1, -1),
  target_overall = +1,
  margin = 0.2, margin_overall = 0.2, max_n = 200
)
table(expanded_df$x) - table(result$final_df$x)


covalchemy documentation built on April 12, 2025, 2:15 a.m.