The fastei
library implements the Expectation-Maximization algorithm to estimate probabilities in non-parametric Ecological Inference for the RxC case. It offers both an exact method and four approximation methods suitable for large datasets. The application demonstrated here is based on an electoral context where, for each ballot box, we know the number of voters in each demographic group and the number of votes received by each candidate.
We use data from the First Round of the Chilean Presidential Election of 2021, where at each ballot box we have the voters of eight age ranges (groups), and candidate votes obtained. The function get_eim_chile()
loads this data at either the country, region, or electoral district level.
library(fastei) eim_apo <- get_eim_chile(elect_district = "APOQUINDO") eim_apo
As shown in the output, it returns an eim object that contains two matrices: the number of votes per candidate and the number of votes per group. The rows of both matrices correspond to the specific ballot-box, and the columns are candidates and groups, respectively. This eim object is used as input to run the EM algorithm that estimates the voting probabilities. In this example, it uses the default method mult
, which is the most efficient in terms of runtime.
Running the algorithm is done by calling run_em()
.
eim_apo <- run_em(eim_apo) round(eim_apo$prob, 4)
Note that each row corresponds to the probability that a demographic group (g
) voted for a candidate (c
). It is worth noting how the estimated probabilities differ substantially across groups.
We can compute the standard deviation of the estimated probabilities using bootstrapping. This can be done with the function bootstrap()
.
eim_apo <- bootstrap(eim_apo, seed = 42, nboot = 30) round(eim_apo$sd, 4)
The standard deviations obtained in the district "Apoquindo" are low in general. One reason for this is the high number of ballot boxes in this district. In contrast, standard deviations of estimated probabilities in districts with fewer ballot boxes, such as "Navidad", are larger.
eim_nav <- get_eim_chile(elect_district = "NAVIDAD") eim_nav <- bootstrap(eim_nav, seed = 42, nboot = 30) round(eim_nav$sd, 4)
It is possible to see the difference in a visual way by plotting the standard deviations of the estimated probabilities across groups.
library(ggplot2) library(reshape2) library(viridis) plot_district <- function(matrix1, district1, matrix2, district2, sd = FALSE) { value <- ifelse(sd == FALSE, "prob", "sd") df1 <- melt(matrix1) df2 <- melt(matrix2) df1$Matrix <- district1 df2$Matrix <- district2 combined_df <- rbind(df1, df2) color <- ifelse(value == "prob", "plasma", "viridis") # Add text to each cell of the matrix combined_df$label <- sprintf("%.3f", combined_df$value) combined_df$text_color <- ifelse(combined_df$value > round(max(combined_df$value) * 0.75 + min(combined_df$value) * 0.25, 2), "black", "white") districts <- sort(c(district1, district2)) # Call the plot ggplot(combined_df, aes(x = Var2, y = Var1, fill = value)) + geom_tile() + geom_text(aes(label = label, color = text_color), size = 2.5) + scale_fill_viridis( name = value, option = color, ) + scale_color_identity() + facet_wrap(~Matrix) + coord_fixed() + theme_bw() + labs( title = ifelse(value == "prob", paste("Estimated probabilities in districts:", districts[1], "and", districts[2]), paste("Standard deviation of estimated probabilities in districts:", districts[1], "and", districts[2]) ), x = "Candidates' votes", y = "Voters' age range", fill = value ) }
plot_district( matrix1 = eim_nav$sd, district1 = "Navidad", matrix2 = eim_apo$sd, district2 = "Apoquindo", sd = TRUE )
Demographic groups can be merged to have probability estimates with lower error. A greedy strategy that maximizes the variability of the distribution of groups across ballot boxes is used, which ensures standard deviations are below a specific threshold. The package provides the following function for the latter:
eim_nav_proxy <- get_agg_proxy(eim_nav, seed = 6, sd_threshold = 0.03, sd_statistic = "maximum") eim_nav_proxy$group_agg
As shown in the output, the heuristic finds a group aggregation that merges the first and last four age ranges, namely, macro-groups with voters aged 18-49, and older than 50. We can evaluate the effectiveness of this grouping by comparing the mean standard deviation to the original formulation:
mean(eim_nav$sd) - mean(eim_nav_proxy$sd)
It is possible to see the aggregated standard deviations visually:
plot_matrix <- function(mat, sd = FALSE, y_labels = NULL) { # Initial configurations if (!sd) mat <- t(mat) df <- reshape2::melt(mat) colnames(df) <- c("Row", "Column", "Value") df$Row <- factor(df$Row, levels = rev(sort(unique(df$Row)))) df$Column <- factor(df$Column, levels = sort(unique(df$Column))) if (!sd) { df$Label <- sprintf("%d", df$Value) title_text <- "Voters distribution" x_lab <- "Ballot Box" y_lab <- "Dem. Group" fill_lab <- "Voters" df$text_color <- ifelse(df$Value > 30, "black", "white") option <- "inferno" start <- 0.5 limits <- NULL } else { df$Label <- sprintf("%.3f", df$Value) title_text <- "Standard deviation of estimated probabilities on district: Navidad" x_lab <- "Candidates' votes" y_lab <- "Voters' age range" fill_lab <- "sd" df$text_color <- ifelse(df$Value > 0.13, "black", "white") option <- "viridis" start <- 0 limits <- c(0, 0.1) } # Plot p <- ggplot(df, aes(x = Column, y = Row, fill = Value)) + geom_tile() + geom_text(aes(label = Label, color = text_color), size = 3) + scale_color_identity() + scale_fill_viridis_c(option = option, begin = start, limits = limits) + coord_fixed() + theme_bw() + theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size = 7)) + labs( title = title_text, x = x_lab, y = y_lab, fill = fill_lab ) # Add custom y-axis labels if provided if (!is.null(y_labels)) { p <- p + scale_y_discrete(labels = y_labels) } p }
We can visualize the standard deviations of the estimated probabilities.
plot_matrix(eim_nav_proxy$sd, sd = TRUE, y_labels = c("X18.49", "X50."))
An exhaustive algorithm is also included in the package. It explores all combinations of adjacent groups in order to maximize the log-likelihood subject to having standard deviations below a given threshold. It might require substantial computation time; therefore it is recommended to use the default method, "mult".
eim_nav_opt <- get_agg_opt(eim_nav, seed = 0, sd_threshold = 0.03, sd_statistic = "maximum") eim_nav_opt$group_agg
The optimal group aggregation differs slightly from one obtained before. In this case, the group aggregation consists in the three age range: 18-29, 30-49, and older than 70.
We can visualize the standard deviations of the estimated probabilities.
plot_matrix(eim_nav_opt$sd, sd = TRUE, y_labels = c("X18.29", "X30.49", "X50."))
A relevant question is how significantly different are the probability estimates of two sets of data, such as two different districts. For instance, we can compare the estimated probabilities of the district "LO BARNECHEA" and "LA GRANJA", whose voting tendencies are expected to differ.
eim_gra <- get_eim_chile("LA GRANJA") eim_gra <- run_em(eim_gra) eim_bar <- get_eim_chile("LO BARNECHEA") eim_bar <- run_em(eim_bar) plot_district(eim_gra$prob, "La Granja", eim_bar$prob, "Lo Barnechea")
A Wald test can be applied to each component of the two probability matrix estimates:
comparison2 <- waldtest( object1 = eim_gra, object2 = eim_bar, method = "mult", nboot = 30, seed = 42, ) round(comparison2$pvals, 3)
Conversely, one may examine the alternative scenario in which ballot boxes within a given region are partitioned into equally sized groups. In this configuration, both groups are expected to exhibit identical probability distributions.
eimRM <- get_eim_chile(region = "METROPOLITANA DE SANTIAGO") n <- nrow(eimRM$X) n_train <- floor(n * 0.5) train_indices <- sample(seq_len(n), n_train) eimRMhalf1 <- eim(X = eimRM$X[train_indices, ], W = eimRM$W[train_indices, ]) eimRMhalf2 <- eim(X = eimRM$X[-train_indices, ], W = eimRM$W[-train_indices, ]) eimRMhalf1 <- run_em(eimRMhalf1) eimRMhalf2 <- run_em(eimRMhalf2) plot_district(eimRMhalf1$prob, "Half 1", eimRMhalf2$prob, "Half 2")
comparison2 <- waldtest( object1 = eimRMhalf1, object2 = eimRMhalf2, nboot = 10, seed = 42 ) round(comparison2$pvals, 4)
It is possible to simulate an artificial election and get its values as an eim object with the simulated probability. This is useful for testing the package's performance and comparing the different methods available.
eim_sim <- simulate_election(num_ballots = 15, num_groups = 2, num_candidates = 3, seed = 42) eim_sim
The real voting probabilities of each group for each candidate can be obtained as follows:
eim_sim$real_prob
In this case, we can access these values since this is a simulation; however, this is not possible when working with real data.
The estimated voting probabilities are:
eim_sim <- run_em(eim_sim) eim_sim$prob
It is possible to provide a specific voting probability for each candidate for each group.
input_probability <- matrix(c(0.9, 0.05, 0.05, 0.2, 0.3, 0.5), nrow = 2, byrow = TRUE) input_probability
eim_sim2 <- simulate_election( num_ballots = 30, num_groups = 2, num_candidates = 3, seed = 42, prob = input_probability ) eim_sim2 eim_sim2$real_prob
There is a parameter, lambda, that controls for the heterogeneity of voters' groups across ballot boxes. A value of lambda = 0 indicates that voters are all randomly assigned in ballot boxes; in contrast to lambda = 1, in which case voters are assigned in ballot boxes according to their demographic group. This is explained in detail in the documentation in simulate_election()
.
eim_sim3 <- simulate_election( num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42, lambda = 0.1 )
plot_matrix(eim_sim3$W)
Having the following estimated probabilities:
run_em(eim_sim3)$prob
On the other side, it is possible to simulate an heterogeneous sample
eim_sim4 <- simulate_election( num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42, lambda = 0.9 )
plot_matrix(eim_sim4$W)
With the underlying probabilities
run_em(eim_sim4)$prob
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.