Demonstration of the package usage
In fastei: Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

The fastei library implements the Expectation-Maximization algorithm to estimate probabilities in non-parametric Ecological Inference for the RxC case. It offers both an exact method and four approximation methods suitable for large datasets. The application demonstrated here is based on an electoral context where, for each ballot box, we know the number of voters in each demographic group and the number of votes received by each candidate.

Estimate Voting Probabilities

We use data from the First Round of the Chilean Presidential Election of 2021, where at each ballot box we have the voters of eight age ranges (groups), and candidate votes obtained. The function get_eim_chile() loads this data at either the country, region, or electoral district level.

library(fastei)

eim_apo <- get_eim_chile(elect_district = "APOQUINDO")

eim_apo

As shown in the output, it returns an eim object that contains two matrices: the number of votes per candidate and the number of votes per group. The rows of both matrices correspond to the specific ballot-box, and the columns are candidates and groups, respectively. This eim object is used as input to run the EM algorithm that estimates the voting probabilities. In this example, it uses the default method mult, which is the most efficient in terms of runtime. Running the algorithm is done by calling run_em().

eim_apo <- run_em(eim_apo)
round(eim_apo$prob, 4)

Note that each row corresponds to the probability that a demographic group (g) voted for a candidate (c). It is worth noting how the estimated probabilities differ substantially across groups.

Standard deviation estimates

We can compute the standard deviation of the estimated probabilities using bootstrapping. This can be done with the function bootstrap().

eim_apo <- bootstrap(eim_apo, seed = 42, nboot = 30)
round(eim_apo$sd, 4)

The standard deviations obtained in the district "Apoquindo" are low in general. One reason for this is the high number of ballot boxes in this district. In contrast, standard deviations of estimated probabilities in districts with fewer ballot boxes, such as "Navidad", are larger.

eim_nav <- get_eim_chile(elect_district = "NAVIDAD")
eim_nav <- bootstrap(eim_nav, seed = 42, nboot = 30)
round(eim_nav$sd, 4)

It is possible to see the difference in a visual way by plotting the standard deviations of the estimated probabilities across groups.

library(ggplot2)
library(reshape2)
library(viridis)

plot_district <- function(matrix1, district1, matrix2, district2, sd = FALSE) {
    value <- ifelse(sd == FALSE, "prob", "sd")
    df1 <- melt(matrix1)
    df2 <- melt(matrix2)
    df1$Matrix <- district1
    df2$Matrix <- district2
    combined_df <- rbind(df1, df2)
    color <- ifelse(value == "prob", "plasma", "viridis")

    # Add text to each cell of the matrix
    combined_df$label <- sprintf("%.3f", combined_df$value)
    combined_df$text_color <- ifelse(combined_df$value > round(max(combined_df$value) * 0.75 + min(combined_df$value) * 0.25, 2), "black", "white")
    districts <- sort(c(district1, district2))

    # Call the plot
    ggplot(combined_df, aes(x = Var2, y = Var1, fill = value)) +
        geom_tile() +
        geom_text(aes(label = label, color = text_color), size = 2.5) +
        scale_fill_viridis(
            name = value,
            option = color,
        ) +
        scale_color_identity() +
        facet_wrap(~Matrix) +
        coord_fixed() +
        theme_bw() +
        labs(
            title = ifelse(value == "prob",
                paste("Estimated probabilities in districts:", districts[1], "and", districts[2]),
                paste("Standard deviation of estimated probabilities in districts:", districts[1], "and", districts[2])
            ),
            x = "Candidates' votes", y = "Voters' age range", fill = value
        )
}

plot_district(
    matrix1 = eim_nav$sd, district1 = "Navidad",
    matrix2 = eim_apo$sd, district2 = "Apoquindo", sd = TRUE
)

Reduce Estimation Error using Group Aggregation

Demographic groups can be merged to have probability estimates with lower error. A greedy strategy that maximizes the variability of the distribution of groups across ballot boxes is used, which ensures standard deviations are below a specific threshold. The package provides the following function for the latter:

eim_nav_proxy <- get_agg_proxy(eim_nav, seed = 6, sd_threshold = 0.03, sd_statistic = "maximum")
eim_nav_proxy$group_agg

As shown in the output, the heuristic finds a group aggregation that merges the first and last four age ranges, namely, macro-groups with voters aged 18-49, and older than 50. We can evaluate the effectiveness of this grouping by comparing the mean standard deviation to the original formulation:

mean(eim_nav$sd) - mean(eim_nav_proxy$sd)

It is possible to see the aggregated standard deviations visually:

plot_matrix <- function(mat, sd = FALSE, y_labels = NULL) {
    # Initial configurations
    if (!sd) mat <- t(mat)
    df <- reshape2::melt(mat)
    colnames(df) <- c("Row", "Column", "Value")
    df$Row <- factor(df$Row, levels = rev(sort(unique(df$Row))))
    df$Column <- factor(df$Column, levels = sort(unique(df$Column)))
    if (!sd) {
        df$Label <- sprintf("%d", df$Value)
        title_text <- "Voters distribution"
        x_lab <- "Ballot Box"
        y_lab <- "Dem. Group"
        fill_lab <- "Voters"
        df$text_color <- ifelse(df$Value > 30, "black", "white")
        option <- "inferno"
        start <- 0.5
        limits <- NULL
    } else {
        df$Label <- sprintf("%.3f", df$Value)
        title_text <- "Standard deviation of estimated probabilities on district: Navidad"
        x_lab <- "Candidates' votes"
        y_lab <- "Voters' age range"
        fill_lab <- "sd"
        df$text_color <- ifelse(df$Value > 0.13, "black", "white")
        option <- "viridis"
        start <- 0
        limits <- c(0, 0.1)
    }

    # Plot
    p <- ggplot(df, aes(x = Column, y = Row, fill = Value)) +
        geom_tile() +
        geom_text(aes(label = Label, color = text_color), size = 3) +
        scale_color_identity() +
        scale_fill_viridis_c(option = option, begin = start, limits = limits) +
        coord_fixed() +
        theme_bw() +
        theme(axis.text.y = element_text(size = 7), axis.text.x = element_text(size = 7)) +
        labs(
            title = title_text,
            x = x_lab,
            y = y_lab,
            fill = fill_lab
        )
    # Add custom y-axis labels if provided
    if (!is.null(y_labels)) {
        p <- p + scale_y_discrete(labels = y_labels)
    }
    p
}

We can visualize the standard deviations of the estimated probabilities.

plot_matrix(eim_nav_proxy$sd, sd = TRUE, y_labels = c("X18.49", "X50."))

An exhaustive algorithm is also included in the package. It explores all combinations of adjacent groups in order to maximize the log-likelihood subject to having standard deviations below a given threshold. It might require substantial computation time; therefore it is recommended to use the default method, "mult".

eim_nav_opt <- get_agg_opt(eim_nav, seed = 0, sd_threshold = 0.03, sd_statistic = "maximum")
eim_nav_opt$group_agg

The optimal group aggregation differs slightly from one obtained before. In this case, the group aggregation consists in the three age range: 18-29, 30-49, and older than 70.

We can visualize the standard deviations of the estimated probabilities.

plot_matrix(eim_nav_opt$sd, sd = TRUE, y_labels = c("X18.29", "X30.49", "X50."))

Test difference between estimates

A relevant question is how significantly different are the probability estimates of two sets of data, such as two different districts. For instance, we can compare the estimated probabilities of the district "LO BARNECHEA" and "LA GRANJA", whose voting tendencies are expected to differ.

eim_gra <- get_eim_chile("LA GRANJA")
eim_gra <- run_em(eim_gra)
eim_bar <- get_eim_chile("LO BARNECHEA")
eim_bar <- run_em(eim_bar)

plot_district(eim_gra$prob, "La Granja", eim_bar$prob, "Lo Barnechea")

A Wald test can be applied to each component of the two probability matrix estimates:

comparison2 <- waldtest(
    object1 = eim_gra,
    object2 = eim_bar,
    method = "mult",
    nboot = 30,
    seed = 42,
)

round(comparison2$pvals, 3)

Conversely, one may examine the alternative scenario in which ballot boxes within a given region are partitioned into equally sized groups. In this configuration, both groups are expected to exhibit identical probability distributions.

eimRM <- get_eim_chile(region = "METROPOLITANA DE SANTIAGO")

n <- nrow(eimRM$X)
n_train <- floor(n * 0.5)
train_indices <- sample(seq_len(n), n_train)

eimRMhalf1 <- eim(X = eimRM$X[train_indices, ], W = eimRM$W[train_indices, ])
eimRMhalf2 <- eim(X = eimRM$X[-train_indices, ], W = eimRM$W[-train_indices, ])

eimRMhalf1 <- run_em(eimRMhalf1)
eimRMhalf2 <- run_em(eimRMhalf2)

plot_district(eimRMhalf1$prob, "Half 1", eimRMhalf2$prob, "Half 2")

comparison2 <- waldtest(
    object1 = eimRMhalf1,
    object2 = eimRMhalf2,
    nboot = 10,
    seed = 42
)

round(comparison2$pvals, 4)

Simulating Election Results

It is possible to simulate an artificial election and get its values as an eim object with the simulated probability. This is useful for testing the package's performance and comparing the different methods available.

eim_sim <- simulate_election(num_ballots = 15, num_groups = 2, num_candidates = 3, seed = 42)
eim_sim

The real voting probabilities of each group for each candidate can be obtained as follows:

eim_sim$real_prob

In this case, we can access these values since this is a simulation; however, this is not possible when working with real data.

The estimated voting probabilities are:

eim_sim <- run_em(eim_sim)
eim_sim$prob

It is possible to provide a specific voting probability for each candidate for each group.

input_probability <- matrix(c(0.9, 0.05, 0.05, 0.2, 0.3, 0.5), nrow = 2, byrow = TRUE)
input_probability

eim_sim2 <- simulate_election(
    num_ballots = 30, num_groups = 2, num_candidates = 3, seed = 42,
    prob = input_probability
)
eim_sim2
eim_sim2$real_prob

There is a parameter, lambda, that controls for the heterogeneity of voters' groups across ballot boxes. A value of lambda = 0 indicates that voters are all randomly assigned in ballot boxes; in contrast to lambda = 1, in which case voters are assigned in ballot boxes according to their demographic group. This is explained in detail in the documentation in simulate_election().

eim_sim3 <- simulate_election(
    num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
    lambda = 0.1
)

plot_matrix(eim_sim3$W)

Having the following estimated probabilities:

run_em(eim_sim3)$prob

On the other side, it is possible to simulate an heterogeneous sample

eim_sim4 <- simulate_election(
    num_ballots = 20, num_groups = 4, num_candidates = 2, seed = 42,
    lambda = 0.9
)

plot_matrix(eim_sim4$W)

With the underlying probabilities

run_em(eim_sim4)$prob

Any scripts or data that you put into this service are public.

fastei documentation built on Sept. 16, 2025, 1:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

fastei
Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

Demonstration of the package usage
In fastei: Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

Estimate Voting Probabilities

Standard deviation estimates

Reduce Estimation Error using Group Aggregation

Test difference between estimates

Simulating Election Results

Try the fastei package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

fastei Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

Demonstration of the package usage In fastei: Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

Estimate Voting Probabilities

Standard deviation estimates

Reduce Estimation Error using Group Aggregation

Test difference between estimates

Simulating Election Results

Try the fastei package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

fastei
Methods for ''A Fast Alternative for the R x C Ecological Inference Case''

Demonstration of the package usage
In fastei: Methods for ''A Fast Alternative for the R x C Ecological Inference Case''