knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(digits = 2)
library(mrpkit)
This vignette presents the typical workflow of mrpkit
, an R package for implementing Multilevel Regression with Post-stratification (MRP).
The aim of mrpkit
is to perform multilevel regression with post-stratification (MRP; Gelman & Little, 1997; Park, Gelman, & Bafumi, 2004). By using this package, you follow a specific workflow for MRP. Hence, it creates a more reproducible and accountable environment for conducting MRP.
Unlike many R packages, mrpkit
uses R6 objects (Chang, 2019). For a user the main implication is that instead of calling a function to perform a task, we call a method embedded in an object using object$method
. Typical usage of this package will create four objects: SurveyData
, QuestionMap
, SurveyMap
, and SurveyFit
. There are several methods in each of these objects that will be explained later in this vignette.
The MRP workflow implemented by mrpkit
is:
SurveyData
objects for the sample and post-stratification datasets.SurveyMap
object to match the values between the sample and post-stratification (population) datasets for each question.SurveyFit
object.A more detailed example of this workflow follows.
In this vignette, we use simulated survey sample and post-stratification datasets. These datasets display the preference of voters to the Box Party (BP), a political party in a fake country, Shape World, along with demographic variables. The population is designed using declare_population
from DeclareDesign
(Blair, Cooper, Coppock, & Humphreys, 2019) to set the features in the population.
The variables in the simulated survey dataset are:
head(shape_survey)
And the summarized simulated population is:
head(approx_voters_popn)
SurveyData
objectSurveyData
objects are used to represent both the sample and the post-stratification datasets, along with their metadata. The sample metadata consists of the survey's variables, the questions that were asked, and the levels of the responses. It is also optionally possible to include the survey weight and its design formula in this object.
The most important methods for SurveyData
objects are SurveyData$new
and SurveyData$print
. The SurveyData$new
method is used to transform a regular data frame into a SurveyData
object, and SurveyData$print
is used to display the created SurveyData
object in the console.
In this example, we will show the steps to create SurveyData
objects using the SurveyData$new
method. This method takes the metadata mentioned previously for its arguments:
data
: The data frame containing the sample or population data.
Note that this method will automatically use each of the factor, character, and binary variables to create questions and responses if you do not specify the list of questions and responses manually. questions
: The question asked in the questionnaire for each of those columns.responses
: The possible responses for each question.weights
: This is an optional argument of the survey or post-stratification/population weight. If the weight is specified as a column in the datasets, then this argument should be specified as a string. In this case, it is "wt" for shape_survey
and approx_voters_popn
.It is possible to have an already summarized post-stratification dataset. If this is the case, then the weight would be the size of each cell. Alternatively, if the entire individual-level population is given, then this argument should be omitted, and it would be automatically specified as 1.
design
: This is an optional argument of the design formula of the survey, specified using the survey
(Lumley, 2020) package notation.Here we create a SurveyData
object for the sample dataset.
box_pref <- SurveyData$new( data = shape_survey, questions = list( age = "Please identify your age group", gender = "Please select your gender", vote_for = "Which party did you vote for in the 2018 election?", highest_educ = "Please identify your completed highest education", state = "Which State do you live in?", y = "If today is election day, will you vote for the Box Party?" ), responses = list( age = levels(shape_survey$age), gender = levels(shape_survey$gender), vote_for = levels(shape_survey$vote_for), highest_educ = levels(shape_survey$highest_educ), state = levels(shape_survey$state), y = c("no", "yes") ), weights = "wt", design = list(ids = ~ 1) ) box_pref$print()
Now we create a SurveyData
object for the population dataset. Usually, the post-stratification data is from a large survey, for example, the American Community Survey (ACS) or Demographic and Health Survey (DHS).
popn_obj <- SurveyData$new( data = approx_voters_popn, questions = c( age_group = "Which age group are you?", gender = "Which gender are you identified?", vote_pref = "Which party do you prefer to vote?", education = "What is the highest grade or level of school you have completed", state = "Please identify the state where you live in" ), responses = list( age_group = levels(approx_voters_popn$age_group), gender = levels(approx_voters_popn$gender), vote_pref = levels(approx_voters_popn$vote_pref), education = levels(approx_voters_popn$education), state = levels(approx_voters_popn$state) ), weights = "wt", design = list(ids = ~ 1) ) # print it popn_obj$print()
In addition, sometimes, we want to access the properties of the SurveyData
object, for example, the weights, the design, the questions and the number of questions, or the responses. The SurveyData
object has methods that allow us to do so: SurveyData$weights()
, SurveyData$design()
, SurveyData$questions()
, SurveyData$n_questions()
, SurveyData$responses()
.
QuestionMap
objectsFrom the metadata above, we can see that the survey (box_pref
) and population (popn_obj
) objects have different column labels and response levels. Specifically, the column label for age in box_pref
is age
, whereas, in popn_obj
, the column label is age_group
. This variable also has different levels; box_pref
has seven levels of age, whereas popn_obj
has only four levels of age.
The column names and levels need to be aligned for MRP and this is accomplished with the QuestionMap$new
method.
This method takes three arguments:
name
: The name of the underlying construct. This will be used in the modeling stage.col_names
: A character vector of the column label in the survey and post-stratification objects. The column name of the survey should always be the first element, followed by the column name of the post-stratification object. This order is not interchangeable. values_map
: The list of the mapped values between the survey and post-stratification object. If there is a meaningful ordering over the values, they should be sorted over that order, either descending or ascending.Here is an example of how this method works:
# Create QuestionMap$object for the question related to age q_age <- QuestionMap$new( name = "age", col_names = c("age", "age_group"), values_map = list( "18-25" = "18-35", "26-35" = "18-35", "36-45" = "36-55", "46-55" = "36-55", "56-65" = "56-65", "66-75" = "66+", "76-90" = "66+" ) ) # Create QuestionMap$object for the question related to party preference q_party_pref <- QuestionMap$new( name = "party_pref", col_names = c("vote_for", "vote_pref"), values_map = list( "Box Party" = "BP", "BP" = "BP", "Circle Party" = "CP", "CP" = "CP" ) ) # Create QuestionMap$object for the question related to gender q_gender <- QuestionMap$new( name = "gender", col_names = c("gender", "gender"), values_map = data.frame( "male" = "m", "female" = "f", "nonbinary" = "nb" ) ) # Create QuestionMap$object for the question related to education q_educ <- QuestionMap$new( name = "highest_education", col_names = c("highest_educ", "education"), values_map = list( "no high school" = "no high school", "high school" = "high school", "some college" = "some college", "associates" = "some college", "4-year college" = "4-years college", "post-graduate" = "post-grad" ) ) # Create QuestionMap$object for the question related to state q_state <- QuestionMap$new( name = "state", col_names = c("state", "state"), values_map = list( "State A" = "A", "State B" = "B", "State C" = "C", "State D" = "D", "State E" = "E" ) )
SurveyMap
objectThe SurveyMap
object holds the mapping between a set of items in the survey and post-stratification datasets. It takes the SurveyData
objects, which in this case are bp_pref
and popn_obj
, together with labels and values that have been matched and provided in the QuestionMap
objects. The mapped object specifies the correspondences in the variables that will be used when fitting the model.
The SurveyMap
object has many methods. Here we demonstrate the new
, add
,
delete
, replace
, mapping
, tabulate
, and fit
methods. Use ?SurveyMap
to see the other available methods.
SurveyMap$new
This method is used to initialize a new SurveyMap
object. This takes SurveyData
and QuestionMap
objects as its argument. You can include all of the questions here, or add it incrementally using add
method that is explained in point 2.
# Create a new SurveyMap object ex_map <- SurveyMap$new(sample = box_pref, population = popn_obj, q_age) # example of mapping all of the questions at once # ex_map <- SurveyMap$new(sample = box_pref, population = popn_obj, # q_age, q_educ, q_gender, q_party_pref, q_state)
SurveyMap$add
To get rid of the warning above, we could use this method to add other QuestionMap
objects to the map.
# Add questions incrementally ex_map$add(q_educ) ex_map$add(q_gender, q_party_pref, q_state) print(ex_map)
SurveyMap$delete
Now, we have all of the questions in the post-stratification object mapped. Sometimes, we do not want to include all of the variables in the post-stratification matrix. For example, in this case, we want to exclude party_pref
from the matrix using delete
method.
# We can also use the label instead of the object name ex_map$delete("party_pref") print(ex_map)
SurveyMap$replace
Suppose that you changed your mind and want to use another level of matching for a certain question. For example, in question regarding education, q4, you want to change the "associates" level to be equal to "4-years college". To do this, you can firstly create a new QuestionMap
object corresponding to that new level mapping. Secondly, you can use the replace
method to change the question object. This method takes two arguments, namely old question
as the first argument and new question
as the second argument.
# Create a new QuestionMap corresponding to the new level matching q_educ_new <- QuestionMap$new( name = "highest_educ", col_names = c("highest_educ", "education"), values_map = list( "no high school" = "no high school", "high school" = "high school", "some college" = "some college", "associates" = "4-years college", "4-year college" = "4-years college", "post-graduate" = "post-grad" ) ) # replace the old question with new question ex_map$replace(q_educ, q_educ_new) print(ex_map)
SurveyMap$mapping
Once you are happy with your map object, you can prepare the mapped data for model fitting with the mapping
method.
ex_map$mapping()
SurveyMap$tabulate
The next step is to prepare the post-stratification table using the tabulate
method. If you want to only use a certain variable in the post-stratification matrix, then you should put that variable as the argument. For instance, here, we only use age
in the post-stratification matrix.
ex_map$tabulate("age")
If you want to include all of the variables in the post-stratification matrix, then the method should not take any arguments.
ex_map$tabulate()
SurveyMap$fit
Finally, we can fit the model using fit
method. Currently, mrpkit
natively supports rstanarm::stan_glmer
and rstanarm::stan_glm
(Goodrich, Gabry, Ali, & Brilleman, 2020), lme4::glmer
(Bates, Maechler, Bolker, & Walker, 2015) , brms::brm
(Bürkner, 2018). However, you can also specify your own function. In this case, it should give a formula
argument and a data
argument that accepts a data frame (the requirement for a formula may be relaxed in the future). In this example, we fit three models using fun=rstanarm::stan_glmer
, fun=lme4::glmer
, and fun=brms::brm
with age
, gender
, and education
as the predictors.
# Example of using rstanarm::stan_glmer fit1 <- ex_map$fit( fun = rstanarm::stan_glmer, formula = y ~ (1 | age) + (1 | gender) + (1 | highest_educ) + (1 | state), family = "binomial", refresh = 0, cores = 2) # Example of using lme4::glmer fit2 <- ex_map$fit( fun = lme4::glmer, formula = y ~ (1 | age) + (1 | gender) + (1 | highest_educ) + (1 | state), family = "binomial")
# Example using brms::brm fit3 <- ex_map$fit( fun = brms::brm, formula = y ~ (1 | age) + (1 | gender), family = "bernoulli", refresh = 100, cores = 2)
The resulting object is not the fitted model object created by the modeling function, but rather a SurveyFit
object that contains the fitted model and provides useful methods for working with it.
SurveyFit
objectAt this stage, we already have the fitted model which has stored as a SurveyFit
object. Using this object we can generate the predicted probabilities for the post-stratification cells, aggregate the small area estimation as needed (e.g., state or province estimates), and visualize the aggregated estimates.
After creating the SurveyFit
object, we can generate the predicted probability of the outcome of each post-stratification cell using the SurveyFit$population_predict
method. It returns a matrix with rows that correspond to the columns of post-stratification data and columns that correspond to the posterior samples.
Furthermore, if the model fitting is done with one of the natively supported functions (rstanarm::stan_glmer
, lme4::glmer
, brms::brm
), then this method does not take any arguments. However, if you used a custom function in the model fitting, then this method takes multiple arguments, including a custom prediction function
(see ?SurveyFit
for details). In this example, we will show how to use this method for the fit1
object created using rstanarm::stan_glmer
.
# predict the probability of voting for the Box Party using the fit1 model poststrat_est_fit1 <- fit1$population_predict() dim(poststrat_est_fit1) nrow(fit1$map()$poststrat_data())
The next step is to generate the population estimate or group estimate using the SurveyFit$aggregate
method. This method takes two arguments, namely the post-stratification estimate data and the variable whose level to which the estimated value would be aggregated to. In this example, we want to aggregate the estimated value by the level of age
and state
. If the variable is not specified, then this method will generate a population estimate.
# aggregate the predicted value by age age_estimation <- fit1$aggregate(poststrat_est_fit1, by = "age") str(age_estimation) # aggregate the predicted value by state state_estimation <- fit1$aggregate(poststrat_est_fit1, by = "state") str(state_estimation) # generate the population estimate popn_estimation <- fit1$aggregate(poststrat_est_fit1) str(popn_estimation)
The returned object is a data frame with number of rows equal to the number of
posterior draws times the number of levels of the by
variable. Each of the
numbers in the value
column represents the estimated probability of the
outcome. In this example, age_estimation
contains the probability of voting
for the Box Party for each age group for each posterior draw.
# get the mean and sd for each age group library(dplyr) age_estimation %>% group_by(age) %>% summarize(mean = mean(value), sd = sd(value))
The output of aggregate
is a posterior predictive distribution for the
population (or sub-populations) of interest. This can be used with the plot
function to visualize. However, we also might wish to make a table summary of
the posterior mean and standard deviation, along with comparisons with raw
estimates and weighted estimates. We can use the summary
method to do this.
This method takes the aggregated data frame and produces a data frame of mean and
standard deviation estimates for each group (if in the aggregated frame) or
overall in the population.
fit1$summary(age_estimation) fit1$summary(state_estimation) fit1$summary(popn_estimation)
The standard deviation for the weighted estimate (wtd
) is created by wrapping
the survey
package using the original design specified, and the raw standard
deviation is $\sqrt\frac{pq}{n}$.\
Once we have the aggregated estimates, we can visualize them using the plot
method. This method takes the aggregated data frame and additional_stats option (the default is wtd and raw estimates) as its arguments. It generates a violin plot of the aggregated estimate by the level of a certain variable. When additional statistics are specified (options are raw
, wtd
and mrp
; more than one can be specified as a vector), these are added as points with error bars representing the 95% confidence interval.
In this example, we display the estimated plot for age
and state
and a density plot for the population estimates.
plot_age <- fit1$plot(age_estimation) plot_age
The plot above shows the distribution of people who vote for the Box Party for each age group. The plot implies that people who are aged 36-55 are most likely to vote for the BP.
The next plot displays the probability of voting for the BP by the state without using additional statistics. We learn that the probability of the BP winning the election is high since most people in almost all the states (around 75%) vote for that party. State B is the state with the lowest probability of vote for the BP. The distribution still ranged more than 50%. Note that this violin plot is generated using ggplot2
(Wickham, 2016). Hence, it could be modified the same way ggplot2
's plot is modified. For example, here, we can add the title and change the theme of the plot.
library(ggplot2) plot_state <- fit1$plot(state_estimation, additional_stats = "none") + ggtitle("Probability of voting the BP by state") + theme_bw() plot_state
Lastly, we show the plot of the population estimate, with lines representing the mean weighted and raw estimates. About 70% of people in the Shape World will be likely to vote for the Box Party.
fit1$plot(popn_estimation, additional_stats = c("raw","wtd"))
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. \doi:10.18637/jss.v067.i01.
Blair, G., Cooper, J., Coppock, A., & Humphreys, M. (2019). Declaring and Diagnosing Research Designs. American Political Science Review 113(3): 838-859. URL https://declaredesign.org/declare.pdf.
Bürkner, P.C. (2018). Advanced Bayesian Multilevel Modeling with the R Package brms. The R Journal, 10(1), 395-411. \doi:10.32614/RJ-2018-017.
Chang, W. (2019). R6: Encapsulated Classes with Reference Semantics. R package version 2.4.1. https://CRAN.R-project.org/package=R6.
Gelman, A., & Little, T. C. (1997). Poststratification into many categories using hierarchical logistic regression. Survey Methodology, 23(2), 127–135.
Goodrich, B., Gabry, J., Ali, I., & Brilleman S. (2020). rstanarm: Bayesian applied regression modeling via Stan. R package version 2.21.1 https://mc-stan.org/rstanarm/.
Lumley, T. (2020) "survey: analysis of complex survey samples". R package version 4.0.
Park, D. K., Gelman, A., & Bafumi, J. (2004). Bayesian multilevel estimation with poststratification: State-level estimates from national polls. Political Analysis, 12(4), 375–385.
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Wickham, H., François, R., Henry, L., & Müller, K. (2020). dplyr: A Grammar of Data Manipulation. R package version 1.0.2. https://CRAN.R-project.org/package=dplyr
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.