View source: R/get_simpsons_paradox_d.R
get_simpsons_paradox_d | R Documentation |
This function modifies contingency tables associated with different levels of a categorical variable to create or highlight Simpson's Paradox using simulated annealing. The paradox occurs when aggregated data trends differ from subgroup trends.
get_simpsons_paradox_d(
x,
y,
z,
manual_vec,
target_overall,
margin,
margin_overall,
max_n = 1000,
temp = 10,
log_odds_general = log_odds_dc
)
x |
A vector of categorical values for the first variable. |
y |
A vector of categorical values for the second variable. |
z |
A vector indicating levels of a third variable that segments the data. |
manual_vec |
A numeric vector specifying target log-odds trends for each level of |
target_overall |
A numeric value representing the target log-odds for the aggregated data. |
margin |
A numeric value for allowed deviation in log-odds within each subgroup. |
margin_overall |
A numeric value for allowed deviation in aggregated log-odds. |
max_n |
An integer specifying the maximum number of iterations for the annealing process. |
temp |
A numeric value for the initial temperature in the annealing process. |
log_odds_general |
A function to compute the log-odds for a given contingency table (default: |
This function works by iteratively modifying individual matrices (contingency tables) corresponding
to levels of z
while respecting log-odds constraints. The overall log-odds of the aggregated table
are also adjusted to achieve the specified target_overall
. Simulated annealing ensures that the
modifications balance between achieving the targets and avoiding overfitting.
A list containing:
final_df
: A data frame representing the modified dataset.
final_table
: A list of modified contingency tables.
history
: A data frame tracking the overall log-odds over iterations.
# Example with predefined contingency tables
set.seed(42)
matrices <- list(
ta = matrix(c(512, 89, 313, 19), ncol = 2, byrow = TRUE),
tb = matrix(c(353, 17, 207, 8), ncol = 2, byrow = TRUE),
tc = matrix(c(120, 202, 205, 391), ncol = 2, byrow = TRUE)
)
df_list <- lapply(seq_along(matrices), function(i) {
mat <- matrices[[i]]
z_level <- names(matrices)[i]
df <- as.data.frame(as.table(mat))
colnames(df) <- c("x", "y", "Freq")
df$z <- z_level
return(df)
})
final_df <- do.call(rbind, df_list)
expanded_df <- final_df[rep(1:nrow(final_df), final_df$Freq), c("x", "y", "z")]
result <- get_simpsons_paradox_d(
expanded_df$x, expanded_df$y, expanded_df$z,
manual_vec = c(-1, -1, -1),
target_overall = +1,
margin = 0.2, margin_overall = 0.2, max_n = 200
)
table(expanded_df$x) - table(result$final_df$x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.