library(learnr) library(tutorial.helpers) library(gt) library(tidyverse) library(primer.data) library(equatiomatic) library(gtsummary) library(marginaleffects) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") # Should "High School Graduate" be "High School"? x <- ces |> filter(year == 2020) |> select(approval, ideology, education) |> drop_na() |> filter(! ideology %in% "Not Sure") |> mutate(ideology = fct_drop(ideology)) # This model used to include faminc as well. There was nothing wrong with that, # per se. But this model (the first to be ordinal) is always complex enough # without that. # There is no ordinal model which works with tidymodels. So, we don't bother # with tidymodels. And that is OK! set.seed(9) fit_approval <- MASS::polr(approval ~ ideology + education, data = x) # tidy() takes so long to run that it makes sense to save the basic table so # that the setup code chunk and the interpretation questions go quicker. # tidy_approval <- tidy(fit_approval, conf.int = TRUE) # write_rds(tidy_approval, file = "data/tidy_approval.rds") tidy_approval <- read_rds("data/tidy_approval.rds") # tbl_approval <- tbl_regression(fit_approval) # write_rds(tbl_approval, file = "data/tbl_approval.rds") tbl_approval <- read_rds("data/tbl_approval.rds") results <- plot_predictions(fit_approval, condition = c("ideology"), draw = FALSE)
WARNING: This tutorial is currently being edited. It is a mess.
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
The power to question is the basis of all human progress. - Indira Gandhi
Load tidyverse.
library(tidyverse)
library(tidyverse)
The data we will be looking at for this tutorial comes form a study where people rate how much they approve the current president.
The Cooperative Election Study (CES) is one of the largest political surveys in the United States.
Load the primer.data package.
library(primer.data)
library(primer.data)
A version of the data from the Cooperative Congressional Election Study is available in the ces
tibble in the primer.data package.
After loading primer.data in your Console, type ?ces
in the Console, and paste in the Description below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
The data has r scales::comma(nrow(ces))
observations and r scales::comma(ncol(ces))
variables.
Approval of the president is the broad topic of this tutorial. Given that topic, which variable in ces
should we use as our outcome variable?
question_text(NULL, message = "The `approval` variable shows the approval of the president.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
approval
is an ordered factor variable.
x |> ggplot(aes(approval)) + geom_bar() + labs( title = "2020 Presidential Approval", subtitle = "A lot of people strongly disapproved of President Trump", x = NULL, y = "Number", caption = "Source: Cooperative Election Study") + coord_flip()
Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.
Describe this imaginary variable and how might we manipulate its value.
question_text(NULL, message = "Consider positive Facebook ads, `ads`, as a treatment. We might expose some people to those ads and not others. Then, we could measure the causal effect of `ads` on presidential approval.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
Given our (imaginary) treatment variable ads
, how many potential outcomes are there for each person? Explain why.
question_text(NULL, message = "There are 2 potential outcomes because the treatment variable `ads` takes on 2 posible values: 0 (if the person did not see the ads) or 1 (if the person did see the ads).", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself.
In a few sentences, specify two different values for the treatment variable, for a single unit, and then guess at the potential outcomes which would result, and then calculate the causal effect for that unit given those guesses.
# XX: Replace [XX: unit] with a better word below given the actual data set we # are using. Replace all the XX terms as appropriate. # For a given individual, assume that the value of the treatment variables might # be 'exposure to Spanish-speakers' or 'no exposure'. If the individual gets # 'exposure to Spanish-speakers', then her attitude toward immigration would be # 10. If the individual gets 'no exposure', then her attitude would be 8. The # causal effect on the outcome of a treatment of exposure to Spanish-speakers # versus no exposure is 10 - 8 --- i.e., the difference between two potential # outcomes --- which equals 2, which is the causal effect. question_text(NULL, message = 'For a given person, assume that the value of the treatment variable might be 1 or 0. If the person gets 1 (meaning they see the Facebook ads), then `approval` would be "Strongly Approve". If the same person gets 0, then `approval` would be "Neither Approve nor Disapprove". The causal effect on the outcome of a treatment of 1 versus 0 is "Strongly Approve" minus "Neither Approve nor Disapprove" --- i.e., the difference between two potential outcomes. However, unlike with numeric outcomes, there is no default metric on which to measure this causal effect, at least without further assumptions.', answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The the definition of a causal effect as the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.
Any causal connection means exploring the within row different between two potential outcomes. We don't need to look at any other rows to have that conversation.
Let's consider a predictive model. Which variable in ces
do you think might have an important connection to approval
?
question_text(NULL, message = "The ideology of an individual could be connected to their approval of the president. If the president is a Republican, we might expect more conservative individuals to have stronger measured approval.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
The key point is that, with a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be treatment variables. We assuming that all covariates are "fixed."
Write a few sentences which specify two different groups of people with different values for ideology
. Explain that the outcome variable might differ between these two groups.
question_text(NULL, message = "Some people might have a value for ideology of Very Liberal. Others might have a value of Very Conservative. Those two groups might, on average, have different values for presidential approval, our outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.
Write a predictive question which connects the outcome variable approval
to covariates of interest.
question_text(NULL, message = "What is the average difference in presidential approval between Very Liberal and Very Conservative people?", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
What is a Quantity of Interest which might help us to explore the answer to our question?
question_text(NULL, message = "One quantity of interest might be the difference in Strong Approval for the President between Very Conservative and Very Liberal people.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.
The doorstep to the temple of wisdom is a knowledge of our own ignorance. - Benjamin Franklin
Our question:
What is the relationship between the approval of the president and an individual's ideology?
In your own words, describe the key components of Wisdom when working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The central problem for Wisdom is: Can we use survey data from the Cooperative Election Study to understand presidential approval and ideology for all adults in the US?
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantities of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This problem is predictive so we will want to compare the outcome, which is presidential approval, between two different groups units, presumably groups which differ in terms of ideology.
Create a Github repo called cumulative
. Make sure to click the "Add a README file" check box.
Connect the cumulative
Github repo to an R project on your computer. Name the R project cumulative
also.
Select File -> New File -> Quarto Document ...
. Provide a title ("Cumulative"
) and an author (you). Render the document and save it as analysis.qmd
.
Edit the .gitignore
by adding *Rproj
. Save and commit this in the Git tab. Push the commit.
In the Console, run:
show_file(".gitignore")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Remove everything below the YAML header from analysis.qmd
and render the file. Command/Ctrl + Shift + K
renders the file, this automatically saves the file as well.
What are the units for this problem?
question_text(NULL, message = "Adults in the United States of America.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.
What is/are the outcome/outcomes for this problem?
question_text(NULL, message = "approval has the approval rating of the president.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Again, approval
has five values: "Strongly Approve", "Approve / Somewhat Approve", "Neither Approve Nor Disapprove", "Strongly Disapprove". These five values give us five numbers. If there were only two values, approve or not approve, then there would only be two numbers we would care about.
Regardless, the central lesson is always the same: You can never look at your data too much.
What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?
question_text(NULL, message = "Sex and race could have a relationship with the approval of the president.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
There are variables outside the data we have that could have a meaningful relationship with approval
. Whether or not the individual comes from an immigrant family could have a relationship with their approval of the president.
What are the treatments, if any, for this problem?
question_text(NULL, message = "There are no treatments.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated, thereby, creating two or more different potential outcomes for each unit.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "2020", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Each president will have different approval rates. We will be focusing on only the year 2020.
Define causal effect.
question_text(NULL, message = "A causal effect is the difference between two potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This is a predictive model, not because of anything intrinsic in the data. We make a predictive model because we are asking a predictive question. We want to compare approval between groups of people who differ in their ideology. We don't want to estimate the causal effect on approval of changes, within a single individual, in ideology.
What is the fundamental problem of causal inference?
question_text(NULL, message = "The fundamental problem of causal inference is that we can only observe one potential outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the treatment variable is continuous (like income), then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
How does the motto "No causal inference without manipulation." apply in this problem?
question_text(NULL, message = "Since the data is only a survey, without any manipulation, we cannot make any causal inferences.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The motto does not apply because this is a predictive, not causal, model.
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "The Preceptor Table has 3 columns. There is a column for the ID, and one for the outcome. There will be a column for ideology Each row represents one individual.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
#| echo: false tibble(ID = c("1", "2", "...", "10", "11", "...", "N"), approval = c("Neither Approve Nor Disapprove", "Disapprove / Somewhat Disapprove", "...", "Strongly Approve", "Never Heard / Not Sure", "...", "Neither Approve Nor Disapprove"), ideology = c("Moderate", "Very Liberal", "...", "Moderate", "Not Sure", "...", "Conservative")) |> gt() |> tab_header(title = "Preceptor Table") |> cols_label(ID = md("ID"), approval = md("Approval of the President"), ideology = md("Political Ideology")) |> tab_style(cell_borders(sides = "right"), location = cells_body(columns = c(ID))) |> tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), locations = cells_column_labels(columns = c(ID))) |> cols_align(align = "center", columns = everything()) |> cols_align(align = "left", columns = c(ID)) |> fmt_markdown(columns = everything()) |> tab_spanner(label = "Covariates", columns = c(ideology)) |> tab_spanner(label = "Outcomes", columns = c(approval))
In analysis.qmd
, delete everything past the YAML heading and load the tidyverse and the primer.data packages in a new code chunk. Label it the set up by adding #| label: setup
. Render the file.
Notice that the file does not look good because it is has code that is showing and it also has messages. To take care of this, add #| message: false
to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all echo from the whole file:
execute: echo: false
In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Render again. Everything looks nice because we have added code to make the file look better and more professional.
Write one sentence describing the data you have to answer your question.
question_text(NULL, message = "The Cooperative Congressional Election Study is one of the largest political surveys in the United States and was started in 2006. The data has how much people approve of the president as well as other useful variables such as ideology.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Of course, when doing the analysis, you don’t know when you start what you will be using at the end. Data analysis is a circular process. We mess with the data. We do some modeling. We mess with the data on the basis of what we learned from the models. With this new data, we do some more modeling. And so on.
Run glimpse()
on ces
.
glimpse(...)
glimpse(ces)
glimpse()
gives us a look at the raw data contained within the ces
data set. At the very top of the output, we can see the number of rows and columns, or observations and variables, respectively. We see that there are 617,455 observations, with each row corresponding to a unique respondent.
Pipe ces
to filter()
with year
set to 2020
inside of it using ==
.
ces |> filter(...)
ces |> filter(year == 2020)
We will be focusing on the year 2020 so that we have fewer number of rows to work with.
Continue the pipe to select()
and have approval
, education
, faminc
, and ideology
inside.
Copy previous code
... |> select(...)
ces |> filter(year == 2020) |> select(approval, education, faminc, ideology)
We are selecting only the rows we will use in the future. Of course, we could choose any of the other variables, but we will be focusing on only these four for this tutorial.
Finish the pipe with drop_na()
. Add slice_sample(n = 2000)
at the very end to only select 2,000 observations from the data.
Copy previous code
... |> drop_na() |> slice_sample(...)
ces |> filter(year == 2020) |> select(approval, education, faminc, ideology) |> drop_na() |> slice_sample(n = 2000)
Behind the scenes of this tutorial, we have created an object called x
using the previous code.
To see the object that was created in the background, run x
.
x
There are now only 54,839 rows which is significantly fewer than the whole ces
data. Having more manageable data will make it easier for us to create models.
In analysis.qmd
, add a new code chunk. Copy the code from two exercises ago which prepares the data and paste it in the code chunk. Set the code to an object called x
.
Render the file. In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
The CES is a 50,000+ person national stratified sample survey administered by YouGov. Half of the questionnaire consists of Common Content asked of all 50,000+ people, and half of the questionnaire consists of Team Content designed by each individual participating team and asked of a subset of 1,000 people.
Pipe x
to summary()
.
summary(...)
summary(x)
We can see some information about our data. Most people have a family income between 20k and 30k. Most people are moderate in terms of political ideology. Most people strongly disapprove the president, the second biggest group for that variable is strongly approve. Most people only have a high school as their highest education.
Pipe x
to ggplot()
with x
set to ideology
and fill
set to approval
. Then add a layer of geom_bar
.
x |> ggplot(aes(...)) + geom_bar()
x |> ggplot(aes(x = ideology, fill = approval)) + geom_bar()
This graphs shows that most liberals strongly disapprove the 2020 president and most conservatives strongly support the 2020 president.
Using the previous code, add labs()
.
Copy previous code
... + labs(...)
x |> ggplot(aes(x = ideology, fill = approval)) + geom_bar() + labs(title = "Relationship Between President Approval and Political Ideology in 2020", subtitle = "Most people strongly disapprove.", x = "Political Ideology", y = "Count", fill = "Approval of the President", caption = "Data from CES.")
The graph should look like this:
#| echo: false x |> ggplot(aes(x = ideology, fill = approval)) + geom_bar() + labs(title = "Relationship Between President Approval and Political Ideology", subtitle = "Most people strongly disapprove.", x = "Political Ideology", y = "Count", fill = "Approval of the President", caption = "Data from CES.")
In analysis.qmd
, add the code the previous exercise to add the graph in the file.
Render the file. In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This graph is useful but we still need to create an actual model with this data in order to answer our questions.
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.
Provide one reason why the assumption of validity might not hold for the outcome variable: approval
.
question_text(NULL, message = "Both the colums in the data we have and the Preceptor Table have a variable called approval, but the Preceptor Table is the true approval of the president where as the data only has the self reported approval of the president. People could lie about their approval of the president.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
These two columns have the same name but they might not be similar enough because the self reported approval of the president might be different from their true approval of the president.
Provide one reason why the assumption of validity might not hold for the covariate: ideology
.
question_text(NULL, message = "There is a column in the Preceptor Table for ideology, and a column in our data for idology. People might misidentify themselves if they do not know the difference between the ideologies or they have a different definition for the ideologies. This might create a difference between their true identity and their self claimed identity.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Again, validity is all about columns. We have to make sure that the columns in our data fully match the columns in the Preceptor Table. Both of the columns talk about ideologies, but key difference is the true political identity vs their self claimed political identity.
Summarize the state of your work so far in one sentence. Make reference to the data you have and to the specific question you are trying to answer.
question_text(NULL, message = "Using the CES, which is one of the largest political surveys in the United States, we seek to make a predictive model which could help us see a relationship between approval of the president and an individual's political ideology.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Edit you answer as you see fit, but do not copy/paste our answer exactly. Add this summary to analysis.qmd
, Command/Ctrl + Shift + K
, and then commit/push.
The arc of the moral universe is long, but it bends toward justice. - Theodore Parker
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time references by the Preceptor Table.
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "We are concerned about the year 2020 and the data we have has information for 2020, so their is no danger to stability. The time the data was gathered is the same as the time refered to in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If we wanted to make a predictive model for this year, then we might have a major stability problem because the relationship between presidential approval and ideology depends a great deal on the ideology of the president whose approval we are measuring.
We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table
. In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.
We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "A large portion of the CES respondents are YouGov panelists. This group might not representative of the whole population because they might be more politically active and care more about politics than other people do.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.
We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "Because the Preceptor Table includes the entire population, there is no problem with representativeness in using the population to draw inferences about the Precetor Table. They are one and the same.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Since the units in the Preceptor Table and the units in the population are both adults in the United States of America, they are representative.
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true.
Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention one specific problem which casts doubt on your approach.
question_text(NULL, message = "Using the CES, which is one of the largest political surveys in the United States, we seek to understand the relationship between presidential approval and political ideology in 2020. One concern is that survey respondents might be systematically different from other Americans.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Edit the summary paragraph in analysis.qmd
as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
, and then commit/push.
Courage is going from failure to failure without losing enthusiasm. - Winston Churchill
Our outcome variable is approval
, an ordered factor with 5 levels. We seek to understand the relationship between it and ideology
, a categorical variable with 6 levels, along with some other variables.
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage starts with math, explores models, and then creates the data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.
Load the tidymodels package.
library(...)
library(tidymodels)
Because approval
is an ordinal variable, we assume that an individual's approval is produced from a Cumulative distribution.
$$ approval_i \sim Cumulative(\rho_{strongly_positive}, \rho_{positive}, \rho_{neutral}, \rho_{negative}, \rho_{strongly_negative}) $$ ### Exercise 3
Load the gtsummary package.
library(...)
library(gtsummary)
Because the outcome variable has a Cumulative distribution, the link function is logit. That is:
$$\rho = \frac{1}{1 + e^{-(\beta_0 +\beta_1 x_1 + \dots)}}$$ ### Exercise 4
Load the equatiomatic package.
library(...)
library(equatiomatic)
This is the basic mathematical structure of our model:
$$ \begin{aligned} \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Strongly\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Strongly\ Disapprove} ) } \right] &= \alpha_{1} + \beta_{1}(\operatorname{ideology}{\operatorname{Liberal}}) + \beta{2}(\operatorname{ideology}{\operatorname{Moderate}}) + \beta{3}(\operatorname{ideology}{\operatorname{Conservative}})\ + \ &\quad \beta{4}(\operatorname{ideology}{\operatorname{Very\ Conservative}}) + \beta{5}(\operatorname{education}{\operatorname{High\ School\ Graduate}}) + \beta{6}(\operatorname{education}{\operatorname{Some\ College}}) + \beta{7}(\operatorname{education}{\operatorname{2-Year}})\ + \ &\quad \beta{8}(\operatorname{education}{\operatorname{4-Year}}) + \beta{9}(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Disapprove\ /\ Somewhat\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Disapprove\ /\ Somewhat\ Disapprove} ) } \right] &= \alpha{2} + \beta_{1}(\operatorname{ideology}{\operatorname{Liberal}}) + \beta{2}(\operatorname{ideology}{\operatorname{Moderate}}) + \beta{3}(\operatorname{ideology}{\operatorname{Conservative}})\ + \ &\quad \beta{4}(\operatorname{ideology}{\operatorname{Very\ Conservative}}) + \beta{5}(\operatorname{education}{\operatorname{High\ School\ Graduate}}) + \beta{6}(\operatorname{education}{\operatorname{Some\ College}}) + \beta{7}(\operatorname{education}{\operatorname{2-Year}})\ + \ &\quad \beta{8}(\operatorname{education}{\operatorname{4-Year}}) + \beta{9}(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Neither\ Approve\ nor\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Neither\ Approve\ nor\ Disapprove} ) } \right] &= \alpha{3} + \beta_{1}(\operatorname{ideology}{\operatorname{Liberal}}) + \beta{2}(\operatorname{ideology}{\operatorname{Moderate}}) + \beta{3}(\operatorname{ideology}{\operatorname{Conservative}})\ + \ &\quad \beta{4}(\operatorname{ideology}{\operatorname{Very\ Conservative}}) + \beta{5}(\operatorname{education}{\operatorname{High\ School\ Graduate}}) + \beta{6}(\operatorname{education}{\operatorname{Some\ College}}) + \beta{7}(\operatorname{education}{\operatorname{2-Year}})\ + \ &\quad \beta{8}(\operatorname{education}{\operatorname{4-Year}}) + \beta{9}(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Approve\ /\ Somewhat\ Approve} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Approve\ /\ Somewhat\ Approve} ) } \right] &= \alpha{4} + \beta_{1}(\operatorname{ideology}{\operatorname{Liberal}}) + \beta{2}(\operatorname{ideology}{\operatorname{Moderate}}) + \beta{3}(\operatorname{ideology}{\operatorname{Conservative}})\ + \ &\quad \beta{4}(\operatorname{ideology}{\operatorname{Very\ Conservative}}) + \beta{5}(\operatorname{education}{\operatorname{High\ School\ Graduate}}) + \beta{6}(\operatorname{education}{\operatorname{Some\ College}}) + \beta{7}(\operatorname{education}{\operatorname{2-Year}})\ + \ &\quad \beta{8}(\operatorname{education}{\operatorname{4-Year}}) + \beta{9}(\operatorname{education}_{\operatorname{Post-Grad}}) \end{aligned} $$ Of course, in a normal data science project, we would explore a variety of different combinations of independent variables before selecting a final set. Just pretend that we already did that.
Add library(tidymodels)
, library(gtsummary)
, and library(equatiomatic)
to the setup
code chunk in your QMD. Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tidymodels|gtsummary|equatiomatic")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Of course, our model must make use of the variables we actually have. Consider:
$$\rho = \frac{1}{1 + e^{-(\beta_0 +\beta_1 ideologyLiberal+\beta_2 ideologyModerate + \dots)}}$$ Recall that a categorical variable (whether character or factor) like sex
is turned into a $0/1$ "dummy" variable which is then re-named something like $sex_{Male}$. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.
Create a model using polr()
from the MASS package. We recommend that you do not load the MASS package, either here or ever. The reason is that MASS contains functions filter()
and select()
which are not the same as the two functions, with the same names, from the dplyr package, which is a part of the Tidyverse. This can cause all sorts of difficult-to-diagnose errors. Instead, we use the double colon notation: MASS::polr()
. This allows us to access the polr()
function without loading the MASS library, although the library must be installed on your computer.
Your arguments to MASS::polr()
should be formula = approval ~ ideology + faminc + education
and data = x
.
MASS::polr(formula = approval ~ ... + faminc + education, ... = x)
The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. If race
has seven values --- "White", "Black", "Hispanic", "Asian", "Native American", "Mixed", and "Middle Eastern" --- then the model creates six 0/1 dummy variables, giving them names like $race_{Black}$ and $race_{Hispanic}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.
Behind the scenes, we have used polr()
to create an object named fit_approval
. This version was created using x
so it has more rows than the one you created in the previous exercise.
Type fit_approval
and hit "Run Code." This generates the same results as using print(fit_approval)
.
fit_approval
fit_approval
In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.
Create a new code chunk in your QMD. Add two code chunk options: label: model
and cache: true
. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_approval
.
Command/Ctrl + Shift + K
. It may take some time to render your QMD, depending on how complex your model is. But, by including cache: true
you cause Quarto to cache the results of the chunk. The next time you render your QMD, as long as you have not changed the code, Quarto will just load up the saved fitted object.
To confirm, Command/Ctrl + Shift + K
again. It should be quick.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
Add *_cache
to .gitignore
file. Commit and push. Cached objects are often large. They don't belong on Github.
Create another code chunk in your QMD. Add the chunk option: label: math
. In that code chunk, add something like the below. You may find it useful to add the coef_digits
argument to show fewer significant digits after the decimal.
extract_eq(fit_approval, intercept = "beta", use_coefs = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "extract")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
When you render your document, this formula will appear.
$$ \begin{aligned} \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Strongly\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Strongly\ Disapprove} ) } \right] &= 2.26 + 0.61(\operatorname{ideology}{\operatorname{Liberal}}) + 2.4(\operatorname{ideology}{\operatorname{Moderate}}) + 4.46(\operatorname{ideology}{\operatorname{Conservative}}) + 5.45(\operatorname{ideology}{\operatorname{Very\ Conservative}}) - 0.1(\operatorname{education}{\operatorname{High\ School\ Graduate}}) - 0.31(\operatorname{education}{\operatorname{Some\ College}}) - 0.32(\operatorname{education}{\operatorname{2-Year}}) - 0.52(\operatorname{education}{\operatorname{4-Year}}) - 0.73(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Disapprove\ /\ Somewhat\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Disapprove\ /\ Somewhat\ Disapprove} ) } \right] &= 2.73 + 0.61(\operatorname{ideology}{\operatorname{Liberal}}) + 2.4(\operatorname{ideology}{\operatorname{Moderate}}) + 4.46(\operatorname{ideology}{\operatorname{Conservative}}) + 5.45(\operatorname{ideology}{\operatorname{Very\ Conservative}}) - 0.1(\operatorname{education}{\operatorname{High\ School\ Graduate}}) - 0.31(\operatorname{education}{\operatorname{Some\ College}}) - 0.32(\operatorname{education}{\operatorname{2-Year}}) - 0.52(\operatorname{education}{\operatorname{4-Year}}) - 0.73(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Neither\ Approve\ nor\ Disapprove} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Neither\ Approve\ nor\ Disapprove} ) } \right] &= 2.82 + 0.61(\operatorname{ideology}{\operatorname{Liberal}}) + 2.4(\operatorname{ideology}{\operatorname{Moderate}}) + 4.46(\operatorname{ideology}{\operatorname{Conservative}}) + 5.45(\operatorname{ideology}{\operatorname{Very\ Conservative}}) - 0.1(\operatorname{education}{\operatorname{High\ School\ Graduate}}) - 0.31(\operatorname{education}{\operatorname{Some\ College}}) - 0.32(\operatorname{education}{\operatorname{2-Year}}) - 0.52(\operatorname{education}{\operatorname{4-Year}}) - 0.73(\operatorname{education}{\operatorname{Post-Grad}}) \ \log\left[ \frac { P( \operatorname{approval} \leq \operatorname{Approve\ /\ Somewhat\ Approve} ) }{ 1 - P( \operatorname{approval} \leq \operatorname{Approve\ /\ Somewhat\ Approve} ) } \right] &= 3.93 + 0.61(\operatorname{ideology}{\operatorname{Liberal}}) + 2.4(\operatorname{ideology}{\operatorname{Moderate}}) + 4.46(\operatorname{ideology}{\operatorname{Conservative}}) + 5.45(\operatorname{ideology}{\operatorname{Very\ Conservative}}) - 0.1(\operatorname{education}{\operatorname{High\ School\ Graduate}}) - 0.31(\operatorname{education}{\operatorname{Some\ College}}) - 0.32(\operatorname{education}{\operatorname{2-Year}}) - 0.52(\operatorname{education}{\operatorname{4-Year}}) - 0.73(\operatorname{education}{\operatorname{Post-Grad}}) \end{aligned} $$
This is our data generating mechanism.
Run tidy()
on fit_approval
with the argument conf.int
set equal to TRUE
. The returns 95% intervals for all the parameters in our model. (This might take 30 seconds or so.)
tidy(..., conf.int = ...)
tidy()
is part of the broom package, used to summarize information from a wide variety of models.
tidy_approval |> select(term, estimate, conf.low, conf.high) |> filter(str_detect(term, "^education"))
Write a sentence interpreting the -0.1
estimate for educationHigh School Graduate
.
question_text(NULL, message = "When comparing (only) high school graduates with people who didn't go to high school, high school graduates have lower value, meaning that they are less likely to approve of the president.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Positive coefficients suggest that higher values of the predictor increase the likelihood of higher approval categories.
Negative coefficients suggest that higher values of the predictor decrease the likelihood of higher approval categories.
Larger coefficients (either positive or negative) indicate a stronger relationship between the predictor and the latent approval score.
tidy_approval |> select(term, estimate, conf.low, conf.high) |> filter(str_detect(term, "^education"))
Write a sentence interpreting the -0.43
to -0.19
confidence interval for educationSome Collegee
.
question_text(NULL, message = "We do not know the true value for the coefficient for educationSome College, but we can be 95% confident that it lies somewhere between -0.43 and -0.19.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the credible interval includes zero, the effect is less likely to be significant.
The intercept is always the first value for any variable. For example, the intercept for education
is No HS
, and intercept for ideology
is Very Liberal
. This is why these values do not show up with tidy(fit_approval)
.
tidy_approval |> select(term, estimate, conf.low, conf.high) |> filter(str_detect(term, "^ideology"))
Write a sentence interpreting the 5.45
estimate for ideologyVeryConservative
.
question_text(NULL, message = "If we compare people who are very conservative with people who are very liberal (the base category which is included in the intercept), the very conservative people approve of the president more. Because of the complexity of the mathematics of the model, 5.45 does not have a natural interpretation in terms of its magnitude.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Strongly Disapprove|Disapprove / Somewhat Disapprove
, Disapprove / Somewhat Disapprove|Neither Approve nor Disapprove
, and so on are intercepts that are used as cut points or thresholds that separate the different levels of the ordinal outcome variable (approval
).
In ordinal regression, these intercepts define the points on the latent variable scale at which the response variable transitions from one category to the next.The probability of falling into a specific category is determined by the relationship between these intercepts and the predictors.
For interactive use, tidy()
is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression()
.
Run tbl_regression(fit_approval)
.
tbl_regression(..)
# tbl_regression(fit_approval)
See this tutorial for a variety of options for customizing your table.
Create a new code chunk in your QMD. Add a code chunk option: label: table
. Add this code to the code chunk. Make sure to include cache = TRUE, and then use the object printed for the following code chunk. Remember that the object has already been created behind the scenes.
tbl_regression(fit_approval)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tbl_regression")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add a sentence to your project summary.
Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."
Recall the beginning of our version of the summary:
Using the CES, which is one of the largest political surveys in the United States, we seek to understand the relationship between presidential approval and political ideology in 2020. One concern is that survey respondents might be systematically different from other Americans.
question_text(NULL, message = "Using the CES, which is one of the largest political surveys in the United States, we seek to understand the relationship between presidential approval and political ideology in 2020. One concern is that survey respondents might be systematically different from other Americans. We are using a cumulative model for ordinal regression.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K
, and then commit/push.
Temperance is the greatest of all virtues. It subdues every passion and emotion, and almost creates a Heaven upon Earth. - Joseph Smith Jr.
In your own words, describe the use of Temperance in data science.
question_text(NULL, message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest.
Load the marginaleffects package.
library(...)
library(marginaleffects)
We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
What is the general topic we are investigating? What is the specific question we are trying to answer?
question_text(NULL, message = "The general question: What is the relationship between the approval of the president and an individual's ideology? The specific question: What is the average difference in presidential approval between Very Liberal and Very Conservative people?", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.
The plot_predictions()
function in the marginaleffects package is like drawing a picture of what your model thinks might happen under different scenarios. Imagine you are working for the Trump Administration and would like to determine what are the odds that Trump beats Kamala in swing-states. We would want to predict the marginal approval rating effect, due to a certain factor. Let's say that i'm a conservative individual with a education level of PhD. This is the use case of plot_predictions()
.
Run the code that uses plot_predictions()
with condition = c("ideology"), selects the group, education, ideology, and estimate columns, and then takes the first row.
plot_predictions(fit_approval, condition = c("education"), draw = FALSE)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE)
Think of this table like a "prediction guide" that uses a model to guess how likely different types of people are to choose certain approval categories. Each row shows:
Who they are: Like a “Moderate” person with a certain education level (e.g., "High School Graduate").
What they pick: A category like “Strongly Disapprove” or “Strongly Approve.”
How likely it is (estimate): The probability that a person with that background would pick that category.
If the estimate is 0.20, that’s like saying: “Out of 100 people with these traits, about 20 might pick this category.” The table just shows a bunch of these predictions for different combinations of background traits. It’s a way to see how changing someone’s education or ideology might change the chances they end up in each approval group.
Let's select for
Enter this code into the exercise code block and hit "Run Code."
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate)
plot_predictions(..., condition = c("..."), draw = ...) |> select(..., ..., ..., ...)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate)
The model predicts how individuals with a Moderate political ideology respond to approval categories based on their education level. Key features include five approval groups: Strongly Disapprove, Disapprove/Somewhat Disapprove, Neither Approve nor Disapprove, Approve/Somewhat Approve, and Strongly Approve. Estimates indicate high probabilities of Strongly Disapproving, especially among those with Post-Grad education (64.3%) and lower disapproval among those with No HS (46.5%). Approval rates are consistently low across all education levels, highlighting education's significant impact on approval tendencies.
Let's take a look at a single row, and interpret it. Utilize the slice()
function to slice the first row.
plot_predictions(..., condition = c(),"..." draw = ...) |> select(...,...,...,...) |> slice(...)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> slice(1)
This row represents the combination of someone that is a moderate w/ a 2-Year education level. Their estimated probability of strongly disapproving is 0.546.
Let's try the same thing but with a different row. Let's slice()
for the 5th row in the dataset.
... |> slice(...)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> slice(5)
If you are a post-grad, and you are a moderate the probability you would strongly disapprove is 0.642. Are you noticing any trends? As we saw for our last row/column pair, the more education level increases, the probability of strongly disapproving increases proportionally.
Let's look at more rows. Slice()
for rows 21:25.
plot_predictions(..., condition = c("..."), draw = ... |> select(...,...,...,...) |> slice(...)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> slice(21:25)
Notice how the probability for a individual to be in the "Approve / Somewhat Approve" group proportionally increases as education level decreases. Independently take a look at how the 'group' column is influencing the probability. Recall that the probability for a moderate, with 2-years to disapprove is significantly larger.
plot_predictions(..., condition = c("..."), draw = ... |> select(...,...,...,...) |> slice(...)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> slice(seq(1, 25, by = 5))
Run this code:
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> group_by(education) |> slice_head(n = 5)
plot_predictions(fit_approval, condition = c("education"), draw = FALSE) |> select(group, education, ideology, estimate) |> group_by(education) |> slice_head(n = 5)
Notice how the probability for each education type, has a probability that adds up to 1. The reason it all adds up to 100% is because a person in a given category must offer one of the five allowed answers. -
Now that the dataset
is ready, begin constructing a plot. Set x to ideology
, y to estimate
, and fill to group
in the aes() function. This will lay the foundation for the visualization.
ggplot(temp, aes(x = ..., y = ..., fill = ...))
Now, add a bar plot to the graph using geom_bar()
. Set stat
to identity so the bars represent the actual values in our dataset.
ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "...")
Continue the model, to make the graph more visually appealing, use scale_fill_brewer(
) to apply a color palette. Choose a palette, such as "RdBu"
, to represent different approval levels. This helps the user differentiate the plot_prediction()
for different groups.
ggplot(temp, aes(x = ..., y = ..., fill = ...)) + geom_bar(stat = "...") + scale_fill_brewer(... = "...")
Add the estimated approval ratings to the bars. Use geom_text()
and set label to display the estimate values with two decimal places.
ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + geom_text(aes(label = sprintf("%.2f", estimate)), position = position_stack(vjust = 0.5), size = 3)
For better readability of the x-axis labels, flip the coordinates of the plot using coord_flip()
.
ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + geom_text(aes(label = sprintf("%.2f", estimate)), position = position_stack(vjust = 0.5), size = 3) + coord_flip()
Next, add a title
, x-axis label
, y-axis label
, and a legend title
. Use the labs()
function to label your plot appropriately.
ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + geom_text(aes(label = sprintf("%.2f", estimate)), position = position_stack(vjust = 0.5), size = 3) + coord_flip() + labs(title = "...", x = "...", y = "...", fill = "Approval Level")
Improve the readability of the axis text by rotating the x-axis labels using element_text() inside the theme() function. Set the angle to 45 degrees. clean up the visual by applying a minimal theme. Use theme_minimal() to reduce clutter and give the graph a sleek look.
ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + geom_text(aes(label = sprintf("%.2f", estimate)), position = position_stack(vjust = 0.5), size = 3) + coord_flip() + labs(title = "Distribution of Approval Levels by Ideology", x = "Ideology", y = "Proportion", fill = "Approval Level") + theme_minimal() + theme(axis.text.x = element_text(angle = ... , hjust = ...))
You are done! we now have a ggplot model that shows the proportion of conservative vs liberals and their associated approval ratings as predicted by plot_predictions.
your model should look like this:
temp <- plot_predictions(fit_approval, condition = c("education"), draw = FALSE) temp$group<- factor(temp$group, levels = c( "Strongly Disapprove", "Disapprove / Somewhat Disapprove", "Neither Approve nor Disapprove", "Approve / Somewhat Approve", "Strongly Approve" )) ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "RdBu") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(title = "Distribution of Approval Levels by Ideology", x = "Ideology", y = "Proportion", fill = "Approval Level") + coord_flip() ggplot(temp, aes(x = ideology, y = estimate, fill = group)) + geom_bar(stat = "identity") + geom_text(aes(label = sprintf("%.2f", estimate)), position = position_stack(vjust = 0.5), size = 3) + scale_fill_brewer(palette = "RdBu") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(title = "Distribution of Approval Levels by Ideology", x = "Ideology", y = "Proportion", fill = "Approval Level") + coord_flip()
Create a new code chunk in analysis.qmd
. Label it with label: plot
. Copy/paste the code which creates your graphic. Don't forget that, at the top of this chunk, you must include code which creates the ndata
object.
Command/Ctrl + Shift + K
to ensure that it all works as intended.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
We could have made many graphs with this data to highlight different things. It is important to only keep information in that is useful and has relationships with the outcome variable.
Write a paragraph which summarizes the project in your own words. The first few sentences are the same as what you had at the end of the Courage section. But, since your question may have evolved, you should feel free to change those sentences. Add at least one sentence which describes at least one quantity of interest (QoI) --- presumably one that answers your question -- and which provides a measure of uncertainty about that QoI.
question_text(NULL, message = "Using the CES, which is one of the largest political surveys in the United States, we seek to understand the relationship between presidential approval and political ideology in 2020. One concern is that survey respondents might be systematically different from other Americans. We are using a cumulative model for ordinal regression. People who are very conservative are more likely to approve the president higher by about 5.6 compared to people who are very liberal. We are 95% confident that the true value is between 4.9 and 6.2.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Edit the summary paragraph in analysis.qmd
as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
.
Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.
question_text(NULL, message = "Our estimates and confidence intervals might not be wrong because the data is not perfect. We have noticed some potential problems such as in validity; people could be lying about thier approval of the president or misidentifying their ideology. These concerns can cause our findings to be inaccurate. The true estimate and confidence interval might be different. The interval would be bigger because we have uncertainties about the data.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
There is almost always uncertainty associated with our conclusions. It is important to notice and acknowledge the potential problems with the data.
Rearrange the material in analysis.qmd
so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K
to ensure that everything works.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add rsconnect
to the .gitignore
file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.
Publish analysis.qmd
to Rpubs. Choose a sensible slug. Copy/paste the url below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This tutorial covered topics related to Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.