knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 2)
library(mrpkit)

This vignette presents the typical workflow of mrpkit, an R package for implementing Multilevel Regression with Post-stratification (MRP).

Introduction

The aim of mrpkit is to perform multilevel regression with post-stratification (MRP; Gelman & Little, 1997; Park, Gelman, & Bafumi, 2004). By using this package, you follow a specific workflow for MRP. Hence, it creates a more reproducible and accountable environment for conducting MRP.

Unlike many R packages, mrpkit uses R6 objects (Chang, 2019). For a user the main implication is that instead of calling a function to perform a task, we call a method embedded in an object using object$method. Typical usage of this package will create four objects: SurveyData, QuestionMap, SurveyMap, and SurveyFit. There are several methods in each of these objects that will be explained later in this vignette.

Workflow

The MRP workflow implemented by mrpkit is:

  1. Prepare the data.
  2. Create SurveyData objects for the sample and post-stratification datasets.
  3. Create a SurveyMap object to match the values between the sample and post-stratification (population) datasets for each question.
  4. Map the questions between the sample and post-stratification datasets.
  5. Fit the model by creating a SurveyFit object.
  6. Predict the probability of the outcome using the post-stratification dataset.
  7. Estimate the probability of the outcome at a level of interest, for example, state/province level.
  8. Create a summary with common comparisons
  9. Visualize the probability of the outcome.

A more detailed example of this workflow follows.

Prepare the data

In this vignette, we use simulated survey sample and post-stratification datasets. These datasets display the preference of voters to the Box Party (BP), a political party in a fake country, Shape World, along with demographic variables. The population is designed using declare_population from DeclareDesign (Blair, Cooper, Coppock, & Humphreys, 2019) to set the features in the population.

The variables in the simulated survey dataset are:

head(shape_survey)

And the summarized simulated population is:

head(approx_voters_popn)

Create a SurveyData object

SurveyData objects are used to represent both the sample and the post-stratification datasets, along with their metadata. The sample metadata consists of the survey's variables, the questions that were asked, and the levels of the responses. It is also optionally possible to include the survey weight and its design formula in this object.

The most important methods for SurveyData objects are SurveyData$new and SurveyData$print. The SurveyData$new method is used to transform a regular data frame into a SurveyData object, and SurveyData$print is used to display the created SurveyData object in the console.

In this example, we will show the steps to create SurveyData objects using the SurveyData$new method. This method takes the metadata mentioned previously for its arguments:

  1. data: The data frame containing the sample or population data. Note that this method will automatically use each of the factor, character, and binary variables to create questions and responses if you do not specify the list of questions and responses manually.
  2. questions: The question asked in the questionnaire for each of those columns.
  3. responses: The possible responses for each question.
  4. weights: This is an optional argument of the survey or post-stratification/population weight. If the weight is specified as a column in the datasets, then this argument should be specified as a string. In this case, it is "wt" for shape_survey and approx_voters_popn.

It is possible to have an already summarized post-stratification dataset. If this is the case, then the weight would be the size of each cell. Alternatively, if the entire individual-level population is given, then this argument should be omitted, and it would be automatically specified as 1.

  1. design: This is an optional argument of the design formula of the survey, specified using the survey (Lumley, 2020) package notation.

Here we create a SurveyData object for the sample dataset.

box_pref <- SurveyData$new(
  data = shape_survey,
  questions = list(
    age = "Please identify your age group",
    gender = "Please select your gender",
    vote_for = "Which party did you vote for in the 2018 election?",
    highest_educ = "Please identify your completed highest education",
    state = "Which State do you live in?",
    y = "If today is election day, will you vote for the Box Party?"
  ),
  responses = list(
    age = levels(shape_survey$age),
    gender = levels(shape_survey$gender),
    vote_for = levels(shape_survey$vote_for),
    highest_educ = levels(shape_survey$highest_educ),
    state = levels(shape_survey$state),
    y = c("no", "yes")
  ),
  weights = "wt",
  design = list(ids =  ~ 1)
)

box_pref$print()

Now we create a SurveyData object for the population dataset. Usually, the post-stratification data is from a large survey, for example, the American Community Survey (ACS) or Demographic and Health Survey (DHS).

popn_obj <- SurveyData$new(
  data = approx_voters_popn,
  questions = c(
    age_group = "Which age group are you?",
    gender = "Which gender are you identified?",
    vote_pref = "Which party do you prefer to vote?",
    education = "What is the highest grade or level of school you have completed",
    state = "Please identify the state where you live in"
  ),
  responses = list(
    age_group = levels(approx_voters_popn$age_group),
    gender = levels(approx_voters_popn$gender),
    vote_pref = levels(approx_voters_popn$vote_pref),
    education = levels(approx_voters_popn$education),
    state = levels(approx_voters_popn$state)
  ),
  weights = "wt",
  design = list(ids =  ~ 1)
)

# print it
popn_obj$print() 

In addition, sometimes, we want to access the properties of the SurveyData object, for example, the weights, the design, the questions and the number of questions, or the responses. The SurveyData object has methods that allow us to do so: SurveyData$weights(), SurveyData$design(), SurveyData$questions(), SurveyData$n_questions(), SurveyData$responses().

Map the values between survey and post-stratification data using QuestionMap objects

From the metadata above, we can see that the survey (box_pref) and population (popn_obj) objects have different column labels and response levels. Specifically, the column label for age in box_pref is age, whereas, in popn_obj, the column label is age_group. This variable also has different levels; box_pref has seven levels of age, whereas popn_obj has only four levels of age.

The column names and levels need to be aligned for MRP and this is accomplished with the QuestionMap$new method.

This method takes three arguments:

  1. name: The name of the underlying construct. This will be used in the modeling stage.
  2. col_names: A character vector of the column label in the survey and post-stratification objects. The column name of the survey should always be the first element, followed by the column name of the post-stratification object. This order is not interchangeable.
  3. values_map: The list of the mapped values between the survey and post-stratification object. If there is a meaningful ordering over the values, they should be sorted over that order, either descending or ascending.

Here is an example of how this method works:

# Create QuestionMap$object for the question related to age
q_age <- QuestionMap$new(
  name = "age",
  col_names = c("age", "age_group"),
  values_map = list(
    "18-25" = "18-35",
    "26-35" = "18-35",
    "36-45" = "36-55",
    "46-55" = "36-55",
    "56-65" = "56-65",
    "66-75" = "66+",
    "76-90" = "66+"
  )
)

# Create QuestionMap$object for the question related to party preference
q_party_pref <- QuestionMap$new(
  name = "party_pref",
  col_names = c("vote_for", "vote_pref"),
  values_map = list(
    "Box Party" = "BP",
    "BP" = "BP",
    "Circle Party" = "CP",
    "CP" = "CP"
  )
)

# Create QuestionMap$object for the question related to gender
q_gender <- QuestionMap$new(
  name = "gender",
  col_names = c("gender", "gender"),
  values_map = data.frame(
    "male" = "m",
    "female" = "f",
    "nonbinary" = "nb"
  )
)

# Create QuestionMap$object for the question related to education
q_educ <- QuestionMap$new(
  name = "highest_education",
  col_names = c("highest_educ", "education"),
  values_map = list(
    "no high school" = "no high school",
    "high school" = "high school",
    "some college" = "some college",
    "associates" = "some college",
    "4-year college" = "4-years college",
    "post-graduate" = "post-grad"
  )
)

# Create QuestionMap$object for the question related to state
q_state <- QuestionMap$new(
  name = "state",
  col_names = c("state", "state"),
  values_map = list(
    "State A" = "A",
    "State B" = "B",
    "State C" = "C",
    "State D" = "D",
    "State E" = "E"
  )
)

Using the SurveyMap object

The SurveyMap object holds the mapping between a set of items in the survey and post-stratification datasets. It takes the SurveyData objects, which in this case are bp_pref and popn_obj, together with labels and values that have been matched and provided in the QuestionMap objects. The mapped object specifies the correspondences in the variables that will be used when fitting the model.

The SurveyMap object has many methods. Here we demonstrate the new, add, delete, replace, mapping, tabulate, and fit methods. Use ?SurveyMap to see the other available methods.

  1. SurveyMap$new

    This method is used to initialize a new SurveyMap object. This takes SurveyData and QuestionMap objects as its argument. You can include all of the questions here, or add it incrementally using add method that is explained in point 2.

# Create a new SurveyMap object
ex_map <-
  SurveyMap$new(sample = box_pref, population = popn_obj, q_age)

# example of mapping all of the questions at once
# ex_map <- SurveyMap$new(sample = box_pref, population = popn_obj,
#                        q_age, q_educ, q_gender, q_party_pref, q_state)
  1. SurveyMap$add

    To get rid of the warning above, we could use this method to add other QuestionMap objects to the map.

# Add questions incrementally
ex_map$add(q_educ)
ex_map$add(q_gender, q_party_pref, q_state)

print(ex_map)
  1. SurveyMap$delete

    Now, we have all of the questions in the post-stratification object mapped. Sometimes, we do not want to include all of the variables in the post-stratification matrix. For example, in this case, we want to exclude party_pref from the matrix using delete method.

# We can also use the label instead of the object name
ex_map$delete("party_pref")
print(ex_map)
  1. SurveyMap$replace

    Suppose that you changed your mind and want to use another level of matching for a certain question. For example, in question regarding education, q4, you want to change the "associates" level to be equal to "4-years college". To do this, you can firstly create a new QuestionMap object corresponding to that new level mapping. Secondly, you can use the replace method to change the question object. This method takes two arguments, namely old question as the first argument and new question as the second argument.

# Create a new QuestionMap corresponding to the new level matching
q_educ_new <- QuestionMap$new(
  name = "highest_educ",
  col_names = c("highest_educ", "education"),
  values_map = list(
    "no high school" = "no high school",
    "high school" = "high school",
    "some college" = "some college",
    "associates" = "4-years college",
    "4-year college" = "4-years college",
    "post-graduate" = "post-grad"
  )
)

# replace the old question with new question
ex_map$replace(q_educ, q_educ_new)
print(ex_map)
  1. SurveyMap$mapping

    Once you are happy with your map object, you can prepare the mapped data for model fitting with the mapping method.

ex_map$mapping()
  1. SurveyMap$tabulate

    The next step is to prepare the post-stratification table using the tabulate method. If you want to only use a certain variable in the post-stratification matrix, then you should put that variable as the argument. For instance, here, we only use age in the post-stratification matrix.

ex_map$tabulate("age")

If you want to include all of the variables in the post-stratification matrix, then the method should not take any arguments.

ex_map$tabulate()
  1. SurveyMap$fit

    Finally, we can fit the model using fit method. Currently, mrpkit natively supports rstanarm::stan_glmer and rstanarm::stan_glm (Goodrich, Gabry, Ali, & Brilleman, 2020), lme4::glmer (Bates, Maechler, Bolker, & Walker, 2015) , brms::brm (Bürkner, 2018). However, you can also specify your own function. In this case, it should give a formula argument and a data argument that accepts a data frame (the requirement for a formula may be relaxed in the future). In this example, we fit three models using fun=rstanarm::stan_glmer, fun=lme4::glmer, and fun=brms::brm with age, gender, and education as the predictors.

# Example of using rstanarm::stan_glmer
fit1 <- ex_map$fit(
  fun = rstanarm::stan_glmer,
  formula = y ~ (1 | age) + (1 | gender) + (1 | highest_educ) + (1 | state),
  family = "binomial",
  refresh = 0,
  cores = 2)

# Example of using lme4::glmer
fit2 <- ex_map$fit(
  fun = lme4::glmer,
  formula = y ~ (1 | age) + (1 | gender) + (1 | highest_educ) + (1 | state),
  family = "binomial")
# Example using brms::brm
fit3 <- ex_map$fit(
  fun = brms::brm,
  formula = y ~ (1 | age) + (1 | gender),
  family = "bernoulli",
  refresh = 100,
  cores = 2)

The resulting object is not the fitted model object created by the modeling function, but rather a SurveyFit object that contains the fitted model and provides useful methods for working with it.

Using the SurveyFit object

At this stage, we already have the fitted model which has stored as a SurveyFit object. Using this object we can generate the predicted probabilities for the post-stratification cells, aggregate the small area estimation as needed (e.g., state or province estimates), and visualize the aggregated estimates.

Get the probability of the outcome from each poststratification cells

After creating the SurveyFit object, we can generate the predicted probability of the outcome of each post-stratification cell using the SurveyFit$population_predict method. It returns a matrix with rows that correspond to the columns of post-stratification data and columns that correspond to the posterior samples.

Furthermore, if the model fitting is done with one of the natively supported functions (rstanarm::stan_glmer, lme4::glmer, brms::brm), then this method does not take any arguments. However, if you used a custom function in the model fitting, then this method takes multiple arguments, including a custom prediction function (see ?SurveyFit for details). In this example, we will show how to use this method for the fit1 object created using rstanarm::stan_glmer.

# predict the probability of voting for the Box Party using the fit1 model
poststrat_est_fit1 <- fit1$population_predict()

dim(poststrat_est_fit1)
nrow(fit1$map()$poststrat_data())

Aggregate the probability of the outcome to a higher level of estimate

The next step is to generate the population estimate or group estimate using the SurveyFit$aggregate method. This method takes two arguments, namely the post-stratification estimate data and the variable whose level to which the estimated value would be aggregated to. In this example, we want to aggregate the estimated value by the level of age and state. If the variable is not specified, then this method will generate a population estimate.

# aggregate the predicted value by age
age_estimation <- fit1$aggregate(poststrat_est_fit1, by = "age")
str(age_estimation)

# aggregate the predicted value by state
state_estimation <- fit1$aggregate(poststrat_est_fit1, by = "state")
str(state_estimation)

# generate the population estimate
popn_estimation <- fit1$aggregate(poststrat_est_fit1)
str(popn_estimation)

The returned object is a data frame with number of rows equal to the number of posterior draws times the number of levels of the by variable. Each of the numbers in the value column represents the estimated probability of the outcome. In this example, age_estimation contains the probability of voting for the Box Party for each age group for each posterior draw.

# get the mean and sd for each age group
library(dplyr)
age_estimation %>%
  group_by(age) %>%
  summarize(mean = mean(value), sd = sd(value))

Summarise and compare

The output of aggregate is a posterior predictive distribution for the population (or sub-populations) of interest. This can be used with the plot function to visualize. However, we also might wish to make a table summary of the posterior mean and standard deviation, along with comparisons with raw estimates and weighted estimates. We can use the summary method to do this. This method takes the aggregated data frame and produces a data frame of mean and standard deviation estimates for each group (if in the aggregated frame) or overall in the population.

fit1$summary(age_estimation)
fit1$summary(state_estimation)
fit1$summary(popn_estimation)

The standard deviation for the weighted estimate (wtd) is created by wrapping the survey package using the original design specified, and the raw standard deviation is $\sqrt\frac{pq}{n}$.\

Visualize it

Once we have the aggregated estimates, we can visualize them using the plot method. This method takes the aggregated data frame and additional_stats option (the default is wtd and raw estimates) as its arguments. It generates a violin plot of the aggregated estimate by the level of a certain variable. When additional statistics are specified (options are raw, wtd and mrp; more than one can be specified as a vector), these are added as points with error bars representing the 95% confidence interval.
In this example, we display the estimated plot for age and state and a density plot for the population estimates.

plot_age <- fit1$plot(age_estimation) 
plot_age

The plot above shows the distribution of people who vote for the Box Party for each age group. The plot implies that people who are aged 36-55 are most likely to vote for the BP.

The next plot displays the probability of voting for the BP by the state without using additional statistics. We learn that the probability of the BP winning the election is high since most people in almost all the states (around 75%) vote for that party. State B is the state with the lowest probability of vote for the BP. The distribution still ranged more than 50%. Note that this violin plot is generated using ggplot2 (Wickham, 2016). Hence, it could be modified the same way ggplot2's plot is modified. For example, here, we can add the title and change the theme of the plot.

library(ggplot2)
plot_state <- fit1$plot(state_estimation, additional_stats = "none") +
  ggtitle("Probability of voting the BP by state") +
  theme_bw()
plot_state

Lastly, we show the plot of the population estimate, with lines representing the mean weighted and raw estimates. About 70% of people in the Shape World will be likely to vote for the Box Party.

fit1$plot(popn_estimation, additional_stats = c("raw","wtd"))

References

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. \doi:10.18637/jss.v067.i01.

Blair, G., Cooper, J., Coppock, A., & Humphreys, M. (2019). Declaring and Diagnosing Research Designs. American Political Science Review 113(3): 838-859. URL https://declaredesign.org/declare.pdf.

Bürkner, P.C. (2018). Advanced Bayesian Multilevel Modeling with the R Package brms. The R Journal, 10(1), 395-411. \doi:10.32614/RJ-2018-017.

Chang, W. (2019). R6: Encapsulated Classes with Reference Semantics. R package version 2.4.1. https://CRAN.R-project.org/package=R6.

Gelman, A., & Little, T. C. (1997). Poststratification into many categories using hierarchical logistic regression. Survey Methodology, 23(2), 127–135.

Goodrich, B., Gabry, J., Ali, I., & Brilleman S. (2020). rstanarm: Bayesian applied regression modeling via Stan. R package version 2.21.1 https://mc-stan.org/rstanarm/.

Lumley, T. (2020) "survey: analysis of complex survey samples". R package version 4.0.

Park, D. K., Gelman, A., & Bafumi, J. (2004). Bayesian multilevel estimation with poststratification: State-level estimates from national polls. Political Analysis, 12(4), 375–385.

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2. https://CRAN.R-project.org/package=dplyr

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York



lauken13/mrpkit documentation built on Aug. 6, 2023, 3:42 a.m.