knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tidyverse) library(scales) library(brms) library(synthjoint) library(ccesMRPrun) library(ccesMRPprep) library(ccesMRPviz)
Variables like party ID and religion are not available in the population target. This is a major challenge in MRP (see Leemann and Wasserfallen, 2017).
One option that researchers often use is to include a area-level proportion and simply "control" for that aggregate variable in the regression. This is not the same as poststratification, but we believe that it helps to account for this variable in some way. (An alternative option, which I examine in the next section is to impute this to the population target through a machine learning model trained on individual data. This strategy introduces some modeling error but still allows for post-stratification.)
Let's explore this proposition with a simple test case. Suppose we want to know the proportion of a CD's adults that have a high school degree or less. This quantity is known in the aggregate and in some joint tables through the ACS, so we have a ground truth in this simulation.
First create the dummy variable of interest.
acs_GA <- acs_GA %>% mutate(is_HS = as.numeric(educ == "HS or Less"), is_BA = as.numeric(educ %in% c("4-Year", "Post-Grad")))
acs_GA
is a sample ACS population table. Education is already in the data, but suppose we do not know it.
With this binary variable, let's compute the CD-level quantity of interest.
val <- acs_GA %>% count(cd, is_HS, is_BA, wt = count) %>% group_by(cd) %>% summarize(prop_HS_truth = sum(is_HS*n)/sum(n)) acs_df <- left_join(acs_GA, val)
Now suppose we have survey data we want to use for MRP.
cces_df <- cces_GA %>% # same as ACS manipulation mutate(is_HS = as.numeric(educ == "HS or Less")) %>% group_by(cd) %>% # suppose you know the population aggregate left_join(distinct(acs_df, cd, prop_HS_truth), by = "cd") %>% ungroup()
we merge prop_HS_truth
here like a CD-level election result: we will suppose that we don't have the educ
variable in the ACS but we do know the CD-level aggregate, so we merge that in. Notice that this duplicates the rows for every CCES respondent in a given CD.
Not surprisingly, the CD-level aggregate is predictive of the outcome (coefficient 0.85, t-stat of 7).
Also notice that for this example, we are basically giving away the answer (prop_HS_truth
is the quantity of interest, but we are going to throw it in the regression). Usually you have a predictive proxy only, like when you want to estimate the CD-level voteshare and you only have the lagged voteshare. We are giving away the answer to give the "regression" strategy its best case scenario.
Now let's try three model specifications:
# formula specs ---- mrp_forms <- c( "omitted" = is_HS ~ (1|cd), "mean_only" = is_HS ~ (1|cd) + prop_HS_truth, "oracle" = is_HS ~ (1|cd) + (1|educ) )
"omitted"
is a version with no relevant variables in the MRP. "prop_HS_truth"
is where we will control for the CD-level predictor but we don't have the individual level education. "oracle"
is the model where we would post-stratify on education -- this is what we would do if we have both education in the survey and population joint distribution.
The ccesMRPrun package will do quick MRP in one step with the mrp_onestep()
function (see documentation). Let's run MRP with these three specs, and bind the results.
# run MRP -- only change formulas at each step mrp_df <- map_dfr(.x = mrp_forms, .f = function(x) { mrp_onestep(x, cces_df, poststrat_tgt = acs_df, weight_var = "weight", add_on = val, verbose = FALSE, .cores = 2, .backend = "cmdstanr")}, .id = "spec") mrp_df
Now let's plot the MRP results.
mrp_plot <- mrp_df %>% mutate(spec = fct_inorder(spec)) scatter_45(mrp_plot, prop_HS_truth, p_mrp_est, lbvar = p_mrp_050, ubvar = p_mrp_900, by_form = ~spec, show_error = TRUE, by_labels = c(omitted = "~ (1|cd)", mean_only = "~ (1|cd) + proportion HS", oracle = "~ (1|cd) + (1|education)"), xlab = "True Proportion of Adults with High School - only Education", ylab = "MRP Estimate") + labs(caption = "Note: Estimand is proportion of CD that has a high school education or less. First model does not account for education. Third model poststratifies on education, and is therefore perfectly correct. Second model is only given CD-level proportion of High School-only graduates (the truth).")
Notice that the second model where we only control for the CD-level aggregate will tighten the estimates around a 45 degree angle, but it does not make the estimates more representative. There is still a bias of the survey undersampling HS-only voters, and controlling for an aggregate does not solve this question.
With a closer look, we see that including the aggregate variable worsens some estimates in this case. High education districts get their estimates pulled away from the truth, while low education districts improve. This again seems to be a function of pooling where the errors become more homogeneous.
mrp_plot %>% filter(spec %in% c("omitted", "mean_only")) %>% transmute(spec, cd, prop_HS_truth, error = p_mrp_est - prop_HS_truth) %>% pivot_wider(id_cols = c(cd, prop_HS_truth), names_from = spec, values_from = error) %>% ggplot(aes(omitted, mean_only, color = prop_HS_truth)) + geom_point() + geom_abline(linetype = "dashed") + scale_x_continuous(labels = unit_format(scale = 100, accuracy = 1, suffix = "pp")) + scale_y_continuous(labels = unit_format(scale = 100, accuracy = 1, suffix = "pp")) + coord_equal() + theme_bw() + scale_color_binned(type = "viridis", labels = percent_format(accuracy = 1)) + annotate("text", x = -Inf, y = Inf, hjust = -0.1, vjust = 1.1, label = "Aggregate-only model\nImproves estimates", size = 2.5) + annotate("text", x = Inf, y = -Inf, hjust = 1.1, vjust = -0.1, label = "Aggregate-only model\nWorsens estimates", size = 2.5) + labs(x = "Error from Omitted Variable Model", y = "Error from Aggregate-Only Model", color = "Actual Percent\nHigh School Only") + theme(axis.text = element_text(color = "black"))
Instead of including the aggregate information, others might opt to model a synthetic joint distribution. The ccesMRPprep function provides several methods to compute such a synthetic table.
acs_syn_mlogit <- synth_mlogit(educ ~ female + age, microdata = cces_GA, poptable = acs_df, # internally will be treated as if we don't know education area_var = "cd")
On average, this produces a model where the margins are smoothed out nationally. In the CCES survey sample, r percent(mean(cces_GA$educ == "HS or Less"))
have a "HS or Less" degree, so it seems like the numbers match that. This is not surprising because there are CD "fixed effects" not captured by gender and age, but it is nonethless troubling for MRP.
The ccesMRPprep package provide two other approaches to estimating the joint -- these incorporate another source of information, which is the margins that are available.
edu_margins <- collapse_table(acs_GA, area_var = "cd", X_vars = "educ", count_var = "count", new_name = "count") edu_margins
Given this data that is simply the marginal distribution of education in each CD, one option is to simply take the product assuming independence
acs_syn_prod <- synth_prod(educ ~ female + age, poptable = acs_GA, newtable = edu_margins, area_var = "cd")
and another is to first comibine the survey modeling then fix those margins to the known population margins.
acs_syn_fix <- synth_smoothfix(educ ~ female + age, microdata = cces_GA, poptable = acs_GA, fix_to = edu_margins, area_var = "cd")
See the ccesMRPprep vignette for validation.
In any case, this allows us to run MRP using this "synthetic" target. Now let's fix the specification and vary the targets
syn_datasets <- list( "survey" = acs_syn_mlogit, "product" = acs_syn_prod, "surveyfix" = acs_syn_fix )
mrsp_df <- map_dfr(.x = syn_datasets, .f = function(x) { mrp_onestep(is_HS ~ (1|educ) + (1|cd), cces_df, poststrat_tgt = mutate(x, count = as.integer(count)), weight_var = "weight", add_on = val, verbose = FALSE, .cores = 2, .backend = "cmdstanr")}, .id = "synth")
And examine the outcome.
mrsp_df %>% mutate(synth = recode_factor(synth, survey = "Survey Model\n(synth_mlogit)", product = "Simple Product\n(synth_prod)", surveyfix = "Survey + Rake\n(synth_smoothfix)")) %>% scatter_45(prop_HS_truth, p_mrp_est, by_form = ~synth, lbvar = p_mrp_050, ubvar = p_mrp_900, show_error = TRUE, xlab = "True Proportion of Adults with High School - only Education", ylab = "MRP Estimate")
The first estimate look bad. If the imputation is faulty, it is not a good idea to use that to post-stratify, either. The only reason the second and third estimates are perfect is because we asked MRP for an outcome for which we supplied the joint for. If we were estimating a different outcome for which the margins are not known (which in practice is always the case), the values would change, of course.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.