In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

library(learnr)
library(tutorial.helpers)
library(gt)

library(tidyverse)
library(primer.data)
library(tidymodels)
library(gtsummary)
library(equatiomatic)
library(marginaleffects)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local") 

nes_92 <- nes |> 
  filter(year == 1992) |> 
  select(sex, pres_vote) |> 
  drop_na() |> 
  mutate(pres_vote = as.factor(case_when(
    pres_vote == "Democrat" ~ "Clinton",
    pres_vote == "Republican" ~ "Bush",
    pres_vote == "Third Party" ~ "Perot",
  ))) 

fit_nes <- multinom_reg(engine = "nnet") |>
  fit(pres_vote ~ sex, data = nes_92)

# Notice that tidy(fit_nes) produces estimates of 4 parameters, which, when
# plugged into standard logistic formulas, would give us the probablities for
# Clinton and Perod for men and women. The probability for Bush is then
# calculated via subtraction.

# I could not get simple plot_predictions() to work. Can you? I think it is
# because marginaleffects does not deal well with situations like this.
# Regardless, we can still have students make a nice looking plot with this
# code.

tmp_p <- plot_predictions(fit_nes, 
                          by = "sex", 
                          type = "prob", 
                          draw = FALSE) |> 
    ggplot(aes(x = group, y = estimate, color = sex)) +
      geom_point(size = 3, position = position_dodge(width = 0.5)) +
      geom_errorbar(aes(ymin = conf.low, 
                        ymax = conf.high), 
                    width = 0.2, 
                    position = position_dodge(width = 0.5)) +
      labs(title = "Voting Preferences by Candidate and Sex",
           x = NULL,
           y = "Estimated Proportion",
           color = "Sex") +
      theme_minimal()

Introduction

This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

The world confronts us. Make decisions we must.

Imagine you are a campaign strategist for this year U.S. Presidential election. Your job is to help your candidate reach the voters who are most likely to support him. But here’s the catch: you need to figure out how gender might influence voters’ choices. Do men and women favor different candidates? Could understanding these patterns help your candidate win over crucial voting blocs?

The Question

It is not the answer that enlightens, but the question. - Eugene Ionesco

Which candidate is mostly voted by men?

Exercise 1

Load tidyverse package.

library(tidyverse)

library(tidyverse)

We will be using the data set about US people's voting result from American National Election Studies survey. The primer.data package includes a version of the main data set with a selection of variables. The full ANES data is much richer than this relatively simple tibble.

Exercise 2

Load the primer.data package.

library(primer.data)

library(primer.data)

A version of the data from American National Election Studies is available in the nes tibble. This tibble consists r scales::comma(nrow(nes)) rows and r scales::comma(ncol(nes)) columns.

Exercise 3

After loading primer.data in your Console, type ?nes in the Console, and paste in the Description below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Note that the survey has been conducted since 1948 before and after each presidential election and some of the questions asked in the survey have changed slightly over time. Further information on this issue can be found at the ANES codebook

Exercise 4

Election is the broad topic of this tutorial. Given that topic, which variable in nes should we use as our outcome variable?

question_text(NULL,
    message = "`pres_vote` is our outcome variable.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

pres_vote is the name of the party to which the voted candidates belong. Such explorations are often restricted to just the two “major party” candidates, the nominees of the Democratic and the Republican parties, Bill Clinton and George HW Bush. But, in 1992, Ross Perot was a very successful “third party” candidate, winning almost 19% of the vote. We transferred each party to its candidate's name.

nes_92 |> 
  ggplot(aes(x = pres_vote)) +
    geom_bar(position = "dodge") +
    labs(title = "Survey of 1992 Presidential Election Votes",
         subtitle = "Clinton was mostly voted in the 1992 Presidential Election",
         x = NULL,
         y = "Count",
         caption = "Source: American National Election Survey")

Exercise 5

Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.

Describe this imaginary variable and how might we manipulate its value.

# XX: In your answer, and for the next few questions, always treat this
# imaginary variable as real by putting backticks around the name. For example,
# with nhanes data, we might imagine a variable called `vitamin` for which `1`
# means that the individual ate vitamins growing up and `0` means they did not.
# Using the words "treatment group" and "control group" as part of your answer
# is often helpful since it reinforces the fact that we are using the Rubin
# Causal Model.

question_text(NULL,
    message = "(This is an example answer.) Imagine a variable called `phone_call` which has a value of `1` if the person received a phone call urging them to vote and `0` if they did not receive such a phone call. We, meaning the organization in charge of making such phone calls, can manipulate this variable by deciding, either randonly or otherwise, whether or not we will call a specific individual.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.

Exercise 6

Given our (imaginary) treatment variable phone_call, how many potential outcomes are there for each individual? Explain why.

question_text(NULL,
    message = "There are 2 potential outcomes because the treatment variable `phone_call` takes on 2 posible values: receive phone call versus no phone call.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself.

Exercise 7

In a few sentences, specify two different values for the imaginary treatment variable phone_call, for a single unit, and then guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.

# XX: Replace [XX: unit] with a better word below given the actual data set we
# are using. Replace all the XX terms as appropriate.

# XX: For a given individual, assume that the value of the treatment variables
# might be 'exposure to Spanish-speakers' or 'no exposure'. If the individual
# gets 'exposure to Spanish-speakers', then her attitude toward immigration
# would be 10. If the individual gets 'no exposure', then her attitude would be
# 8. The causal effect on the outcome of a treatment of exposure to
# Spanish-speakers versus no exposure is 10 - 8 --- i.e., the difference between
# two potential outcomes --- which equals 2, which is the causal effect.

# XX: If the outcome is a character variable, like Strongly Approve, then there
# is no simple metric on which we can pinpoint the causal effect. That is, the
# causal effect is still defined --- as, in this example, the difference between
# Strongly Approve and Neutral --- but can not be expressed as a number, at
# least without further work.

question_text(NULL,
message = "For a given individual, assume that the value of the treatment variable might be `receive phone call` or `no phone call`. If this individual gets `receive phone call`, then they vote for Bush. If the individual gets `no phone call`, they vote for Clinton. The causal effect of `receive phone call` versus `no phone call` is the difference between voting for Bush and voting for Clinton, a difference with no uniquely-defined counterpart.",
answer(NULL, correct = TRUE),
allow_retry = FALSE,
incorrect = NULL,
rows = 6)

The the definition of a causal effect as the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.

Any causal connection means exploring the within row different between two potential outcomes. We don't need to look at any other rows to have that conversation.

Exercise 8

Let's consider a predictive model. Which variable in nes do you think might have an important connection to pres_vote?

question_text(NULL,
    message = "`sex` is a potential variable that may relate to `pres_vote`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be a treatment variable. We assuming that the values of all covariates are "fixed."

Exercise 9

Specify two different groups of individuals which have specific value for sex and which might have different average values for the pres_vote.

question_text(NULL,
    message = "Consider two groups, the first with a value for `sex` of `Male`. Others might have a value of `Female`. Those two groups will, on average, have different values for the probability of voting for a specific candidate.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.

You can only use causal language --- like "affect," "influence," "be associated with," "cause,", "causal effect," et cetera --- in your question if you are creating a causal model, one with a treatment variable which you might, at least in theory, manipulate and with at least two potential outcomes.

Exercise 10

Write a predictive question which connects the outcome variable pres_vote to sex, the covariate of interest.

# XX: If it is causal, you should use key causal language in the question, like
# "What is the causal effect of the treatment on the outcome?" Example: "What is
# the causal effect of exposure to Spanish-speakers on attitudes toward
# immigration?" If the model is predictive, the question should clearly compare
# two groups of units. "What is the difference in the outcome variable between
# two groups of units?" Example:  "What is the difference in immigration
# attitudes between Democrats and Republicans?" In both cases, the word
# "average" is implicit in the question.

question_text(NULL,
    message = "What was the difference in voting preference of men and women in the 1992 US Presidential election among supporters of the three leading candidates: Clinton, Bush and Perot?.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The question can not specify values for more than one covariate, simply because you do not know which covariates will be included in the model until you create it. You must mention the covariate (i.e., the treatment) in a causal model. Also, it is not unreasonable to specify, before you start, a covariate whose connection, if any, to the outcome is of special interest.

Exercise 11

What is a Quantity of Interest which might help us to answer our question?

question_text(NULL,
    message = "One quantity of interest is the probability of voting for Clinton",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.

Wisdom

The only true wisdom is in knowing you know nothing. - Socrates

Our question:

What was the difference in voting preference of men and women in the 1992 US Presidential election among supporters of the three leading candidates: Clinton, Bush and Perot?

Exercise 1

In your own words, describe the key components of Wisdom for working on a data science problem.

question_text(NULL,
    message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The central problem for Wisdom is: Can we use data from nes to predict the voting behavior of men and women in the US this year? When was the data collected? Is the question in the survey across year the same?

Exercise 2

Define a Preceptor Table.

question_text(NULL,
    message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantities of interest.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.

Exercise 3

Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."

question_text(NULL,
    message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This problem is predictive so there are only covariates. In our problem, one of the covariates is each individual's sex.

Exercise 4

Create a Github repo called four-categorical-variables. Make sure to click the "Add a README file" check box.

Connect the Github repo to an R project on your computer. Give the R project the same name.

Select File -> New File -> Quarto Document .... Provide a title -- "Four-Categorical-Variables" -- and an author (you). Render the document and save it as analysis.qmd.

Edit the .gitignore by adding *Rproj. Save and commit this in the Git tab. Push the commit.

In the Console, run:

show_file(".gitignore")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 7)

Remove everything below the YAML header from analysis.qmd and render the file. Command/Ctrl + Shift + K first saves the file and then renders it.

Exercise 5

What are the units for this problem?

question_text(NULL,
    message = "Individual US voters, one row per person. The rows of the Preceptor Table are the units, the objects on which the outcome is measured.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We are looking at who voted for these three candidates: Bush, Clinton and Perot. The question suggests that we are not interested in people who did not vote, although one might explore if men were more or less likely to vote in the first place. As always, the initial question rarely specifies the Preceptor Table precisely.

Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.

Exercise 6

What is the outcome for this problem?

question_text(NULL,
    message = "The outcome for this problem is the presidential voting result of each individual voter. This is not the same thing as the answer to the question we have beeen asked. But, if we can build a model which explains/understands/predicts voting result for an individual, we can use that model to answer our questions.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

nes_92 |> 
  ggplot(aes(x = pres_vote, fill = sex)) +
    geom_bar(position = "dodge") +
    labs(title = "Survey of 1992 Presidential Election Votes",
         subtitle = "Men were much more likely to support Ross Perot",
         x = NULL,
         y = "Count",
         caption = "Source: American National Election Survey")

Regardless, the central lesson is always the same: You can never look at your data too much.

Exercise 7

What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?

question_text(NULL,
    message = "We will certainly need sex, with two values: “Male” and “Female”. Other variables which might be helpful include party, income, race and past voting history.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of variables in the data which we end up using in the model.

## XX: Make a nice looking plot which shows the outcome variable and at least
## one covariate. The subtitle should highlight an interesting/important
## observation about the outcome.

Exercise 8

What are the treatments, if any, for this problem?

question_text(NULL,
    message = "There are no treatment variables.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated and, thereby, creating two or more different potential outcomes for each unit. Since this represents a predictive model, there are no treatments.

Exercise 9

What moment in time does the Preceptor Table refer to?

question_text(NULL,
    message = "The Presidential Election result in 1992.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This is often implicit in the question itself. One of our key roles as data scientists is to clarify the questions which we are asked. In this case, it seems clear that the questions refer to the past, but it can be for present, and even the future.

Exercise 10

Define a causal effect.

question_text(NULL,
    message = "A causal effect is the difference between two potential outcomes.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The point of the Rubin Causal Models is that the definition of a causal effect is the difference between two potential outcomes. So, there must be two (or more) potential outcomes for any causal model to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes. But, if the treatment variable is continuous, (like income) then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.

Exercise 11

What is the fundamental problem of causal inference?

question_text(NULL,
    message = "The fundamental problem of causal inference is that we can only observe one potential outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A person cannot experience both treatments at the same time. For instance, if give call a person before the election time and record their voting result, you cannot then rewind to that exact same time, withhold the call, and record their voting behavior.

Exercise 12

How does the motto "No causal inference without manipulation." apply in this problem?

question_text(NULL,
    message = "The motto does not apply because this is a predictive, not causal, model.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We have to choose a variable that we can change to be the treatment. If we do not have such variable that we can manipulate, then we would have to create a predictive model instead. For example, if we were focused on individuals' voting, one conclusion maybe: the probability of voting for Clinton of women is expected to be higher than that of men. Correlation does not mean causation, we cannot assume that sex directly makes people prefer Clinton. In order to find a causation relationship, we would need to manipulate the treatment so that we can measure its effect on the outcome.

Exercise 13

Describe in words the Preceptor Table for this problem.

question_text(NULL,
    message = "The Preceptor Table has one row for each voter, one output column for which party was voted and one covariate, sex.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table for this problem looks something like this:

#| echo: false
tibble(ID = c("1", "2", "...", "10", "11", "...", "103,754,865"),
       vote = c("Democrat", "Third Party", "...", "Republican", "Democrat", "...", "Republican"),
       sex = c("M", "F", "...", "F", "F", "...", "M")) |>

  gt() |>
  tab_header(title = "Preceptor Table") |> 
  cols_label(ID = md("ID"),
             vote = md("Vote"),
             sex = md("Sex")) |>
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(ID))) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(ID))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(ID)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Outcome", columns = c(vote)) |>
  tab_spanner(label = "Covariate", columns = c(sex))

Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem.

Exercise 14

In analysis.qmd, load the tidyverse and the primer.data packages in a new code chunk. Label it as the setup code chunk by adding #| label: setup. Render the file.

Notice that the file does not look good because it is has code that is showing and it also has messages. To take care of this, add #| message: false to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all code echos from the whole file:

execute: 
  echo: false

In the Console, run:

show_file("analysis.qmd", start = -5)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 6)

Render again. Everything looks nice because we have added code to make the file look better and more professional.

Exercise 15

Run glimpse() on nes.

glimpse(nes)

glimpse(nes)

The nes data set has 18 variables, including political attitudes and biographical information of the participants. We do not care about most of these variables for our question.

Note that the answer for both sex and education are words, but they are different in terms of type. Specifically, the former is character and the latter is factor. This is because for sex, we cannot give a ranking for Female and Male, but we can do that with education.

Exercise 16

Pipe nes to filter() and set the argument year to 1992.

nes |> 
  filter (...)

nes |> 
  filter(year == 1992)

This only looks at data from the year 1992.

Exercise 17

Continue the pipe with select(), be sure to select the pres_vote and sex columns. Most data sets have some NA values, we have to get rid of these so that we can use the data. Continue the pipe with drop_na().

... |>
  select(..., ...) |>
  drop_na()

nes |> 
  filter(year == 1992) |>
  select(pres_vote, sex) |> 
  drop_na()

We have to clean the data so we can focus on the specific numbers to answer our question. We only keep the data we need, which is the candidate voted and the sex of the voter.

Exercise 18

Finish the pipe with mutate(). Set the pres_vote argument to as.factor(case_when()). Inside of case_when(), change pres_vote from the name of the political party to the name of the candidate that was voted.

Good news. We did this for you! Just hit "Run Code".

nes |> 
  filter(year == 1992) |>
  select(pres_vote, sex) |>
  drop_na() |> 
  mutate(pres_vote = as.factor(case_when(
    pres_vote == "Democrat" ~ "Clinton",
    pres_vote == "Republican" ~ "Bush",
    pres_vote == "Third Party" ~ "Perot",
  )))

... |> 
  mutate(pres_vote = as.factor(case_when(
    pres_vote == "Democrat" ~ ...,
    pres_vote == ... ~ "Bush",
    pres_vote == ... ~ ..
  )))

nes |> 
  filter(year == 1992) |>
  select(pres_vote, sex) |>
  drop_na() |> 
  mutate(pres_vote = as.factor(case_when(
    pres_vote == "Democrat" ~ "Clinton",
    pres_vote == "Republican" ~ "Bush",
    pres_vote == "Third Party" ~ "Perot"
  )))

The data uses the name of the political party, but, we want the name of the candidate specifically. The type of pres_vote in the data set is character but the model only accepts factor variables. We have to change the type of pres_vote to factor using as.factor().

nes_92 |> 
  ggplot(aes(x = pres_vote, fill = pres_vote)) +
    geom_bar(position = "dodge") +
    labs(title = "Survey of 1992 Presidential Election Votes",
         subtitle = "Clinton was mostly voted in the 1992 Presidential Election",
         x = NULL,
         y = "Count",
         caption = "Source: American National Election Survey", 
         fill = "Candidate")

Exercise 19

In analysis.qmd, add a new code chunk to the QMD, copy/paste the pipeline above and assign the result to the new object nes92:

nes_92 <- nes |> 
  filter(year == 1992) |> 
  select(sex, pres_vote) |> 
  drop_na() |> 
  mutate(pres_vote = as.factor(case_when(
    pres_vote == "Democrat" ~ "Clinton",
    pres_vote == "Republican" ~ "Bush",
    pres_vote == "Third Party" ~ "Perot",
  )))

Command/ctrl + Shift + K follows. In the Console, run:

show_file("analysis.qmd", start = -5)

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Now that you have an object nes92, a subset of nes that has been cleaned and narrowed down to meet the requirement of our question.

Exercise 20

In your own words, define "validity" as we use the term.

question_text(NULL,
    message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.

Exercise 21

Provide one reason why the assumption of validity might not hold for the outcome variable: pres_vote or for one of the covariates. Use the words "column" or "columns" in your answer.

question_text(NULL,
    message = "People may claim that they voted for one candidate when they really voted for another. This causes the column `pres_vote` in the data we have do not match up with the `pres_vote` column in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.

Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.

Exercise 22

Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.

The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. It is not connected to the initial "Imagine that you are XX" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.

The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gather, how many observations it includes and the organization (if famous) which collected it.

Type your two sentences below.

question_text(NULL,
    message = "Understanding the voter preference of different genders is essential for a candidate to design the campaign strategy. Using data from the National Election Studies survey of US citizens, we seek to understand the relationship between voter preference and sex in the 1992 Presidential election.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to XX.qmd, Command/Ctrl + Shift + K, and then commit/push.

Justice

Justice delayed is justice denied. - William E. Gladstone

Exercise 1

In your own words, name the four key components of Justice for working on a data science problem.

question_text(NULL,
    message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.

Exercise 2

In your own words, define a Population Table.

question_text(NULL,
    message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.

Here is our Population Table:

#| echo: false
tibble(source = c("PT/Data", "PT/Data", "PT", "PT", "PT", "PT", "...", "PT/Data", "PT/Data", "PT",  "PT",  "...", "PT/Data"),
       ID = c("1", "2", "3", "4", "5", "6", "...", "10", "11", "12", "13", "...", "103,754,865"),
       vote = c("Democrat", "Third Party", "Republican", "Democrat", "Democrat", "Democrat",  "...", "Republican", "Democrat", "Democrat", "Republican", "...", "Republican"),
       sex = c("M", "F", "M", "F", "F", "M", "...", "F", "F", "...", "F", "...", "M")) |>

  gt() |>
  tab_header(title = "Population Table") |> 
  cols_label(source = md("Source"),
             ID = md("ID"),
             vote = md("Vote"),
             sex = md("Sex")) |>
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(ID))) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(ID))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(ID)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Outcome", columns = c(vote)) |>
  tab_spanner(label = "Covariate", columns = c(sex))

Exercise 3

In your own words, define the assumption of "stability" when employed in the context of data science.

question_text(NULL,
    message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time referenced by the Preceptor Table.

Exercise 4

Provide one reason why the assumption of stability might not be true in this case.

question_text(NULL,
    message = "The time of voting varies across individuals, some may voted right before the deadline while some voted weeks before the election. Their preferences at each time may be different.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Some voters cast their ballots weeks before Election Day. Some NES participants were surveyed right after the election. Some were survey later. We sweep all these complications under the mythical moment in time which we assert is the same for both the data and the Preceptor Table.

A change in time or the distribution of the data does not, in and of itself, demonstrate a violation of stability. Stability is about the parameters: $\beta_0$, $\beta_1$ and so on. Stability means these parameters are the same in the data as they are in the population as they are in the Preceptor Table.

Exercise 5

We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table. In your own words, define the assumption of "representativeness" when employed in the context of data science.

question_text(NULL,
    message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.

Exercise 6

We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.

# XX: In your answer, try not use of the concept time, even though, in theory,
# it is a perfectly reasonable to do so. Instead, focus on why the data might
# not be representative of the population at that moment in time.

question_text(NULL,
    message = "XX",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.

Exercise 7

We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "Since voting is a right, but not compulsory for everyone, the people who voted may not be representative of the entire population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability looks across time periods. Representativeness looks within time periods, for the most part.

Exercise 8

In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.

question_text(NULL,
    message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.

Exercise 9

Write one sentence which highlights a potential weakness in your model. This will almost always be derived from possible problems with the assumptions discussed above. We will add this sentence to our summary paragraph. So far, our version of the summary paragraph looks like this:

Understanding the voter preference of different genders is essential for a candidate to design a campaign strategy. Using data from the National Election Studies survey of US citizens, we seek to understand the relationship between voter preference and sex in the 1992 Presidential election.

Of course, your version will be somewhat different.

question_text(NULL,
    message = "Understanding the voter preference of different genders is essential for a candidate to design a campaign strategy. Using data from the National Election Studies survey of US citizens, we seek to understand the relationship between voter preference and sex in the 1992 Presidential election. However, since not everyone participates in the survey, the data might not be representative of the entire population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Add a weakness sentence to the summary paragraph in your QMD. You can modify your paragraph as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K, and then commit/push.

Courage

Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe

Exercise 1

In your own words, describe the components of the virtue of Courage for analyzing data.

question_text(NULL,
    message = "Courage begins with the exploration and testing of different models. It concludes with the creation of a data generating mechanism.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.

Exercise 2

Load the tidymodels package.

library(...)

library(tidymodels)

The probability family is determined by the outcome variable pres_vote. Because the outcome variable is a categorical value with 3 possible values, the probability family is the multinomial distribution.

$$Y \sim \text{Mutinomial}(\rho_{bush}, \rho_{clinton}, \rho_{perot})$$

where $$ \rho_{bush} + \rho_{clinton} + \rho_{perot} = 1 $$

Exercise 3

Load the gtsummary package.

library(...)

library(gtsummary)

The link function, the basic mathematical structure of the model, is (mostly) determined by the type of outcome variable. The link function for a multinomial distribution is the logit link function.

$$ \begin{aligned} \rho_{clinton} &=& \frac{e^{\beta_{0, clinton} + \beta_{1, clinton} male}}{1 + e^{\beta_{0, clinton} + \beta_{1, clinton} male}}\ \rho_{perot} &=& \frac{e^{\beta_{0, perot} + \beta_{1, perot} male}}{1 + e^{\beta_{0, perot} + \beta_{1, perot} male}}\ \rho_{bush} &=& 1 - \rho_{clinton} - \rho_{perot} \end{aligned} $$

The model when fitted produces estimates of 4 parameters which, when plugged into the standard logistic formulas, would give us the probabilities for Clinton and Perot for men and women. The probability for Bush is then calculated via subtraction.

Exercise 4

Add library(tidymodels), library(gtsummary) to the setup code chunk in analysis.qmd. Copy and paste the below code for the mathematical structure of the model to the body of analysis.qmd. Command/Ctrl + Shift + K.

At the Console, run:

$$ Y \sim \text{Mutinomial}(\rho_{bush}, \rho_{clinton}, \rho_{perot}) $$

$$ \begin{aligned}
\rho_{clinton} &=& \frac{e^{\beta_{0, clinton} + \beta_{1, clinton} male}}{1 + e^{\beta_{0, clinton} + \beta_{1, clinton} male}}\\
\rho_{perot} &=& \frac{e^{\beta_{0, perot} + \beta_{1, perot} male}}{1 + e^{\beta_{0, perot} + \beta_{1, perot} male}}\\
\rho_{bush}  &=& 1 - \rho_{clinton} - \rho_{perot}
\end{aligned} $$

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Recall that a categorical variable (whether character or factor) like sex is turned into a $0/1$ "dummy" variable which is then re-named something like $male$ in this case that takes the value of 1 when the voter is Male and 0 when the voter is Female. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.

Exercise 5

Because our outcome variable is multinomial, start to create the model by using multinom_reg(engine = "nnet").

multinom_reg(engine = ...)

multinom_reg(engine = "nnet")

The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. If race has three values --- "black", "hispanic", and "white" --- then the model creates two 0/1 dummy variables, giving them names like $race_{hispanic}$ and $race_{white}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied. However, note that there is no variable like this in our model.

Exercise 6

Continue the pipe to fit(pres_vote ~ sex, data = nes_92).

... |> 
  fit(..., data = ...)

multinom_reg(engine = "nnet") |> 
  fit(pres_vote ~ sex, data = nes_92)

We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:

$$ \begin{aligned} \hat{\rho}{clinton} &=& \frac{e^{0.45 - 0.25 male}}{1 + e^{0.45 - 0.25 male}}\ \hat{\rho}{perot} &=& \frac{e^{-0.85 + 0.42 male}}{1 + e^{-0.85 + 0.42 male}}\ \hat{\rho}{bush} &=& 1 - \hat{\rho}{clinton} - \hat{\rho}_{perot} \end{aligned} $$

There are three main differences between this representation of the model and our previous one. First, we replace the parameters with our best estimate of their values. Second, the error term is gone. Third, the dependent variable now has a "hat," indicating that it is our "fitted" value, our best guess as to the value of the outcome, given the values of the independent variables for any given unit.

Exercise 7

Behind the scenes of this tutorial, an object called fit_nes has been created which is the result of the code above. Type fit_nes and hit "Run Code." This generates the same results as using print(fit_nes).

fit_nes

fit_nes

In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.

Exercise 8

Create a new code chunk in analysis.qmd. Add two code chunk options: label: model and cache: true. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_nes.

Command/Ctrl + Shift + K. It may take some time to render analysis.qmd, depending on how complex your model is. But, by including cache: true you cause Quarto to cache the results of the chunk. The next time you render analysis.qmd, as long as you have not changed the code, Quarto will just load up the saved fitted object.

To confirm, Command/Ctrl + Shift + K again. It should be quick.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

Add *_cache to .gitignore file. Commit and push. Cached objects are often large. They don't belong on Github.

Exercise 9

Create another code chunk in analysis.qmd. Add the chunk option: label: math. In that code chunk, add something like the below.

$$ \begin{aligned}
\hat{\rho}_{clinton} &=& \frac{e^{0.45 - 0.25 male}}{1 + e^{0.45 - 0.25 male}}\\
\hat{\rho}_{perot} &=& \frac{e^{-0.85 + 0.42 male}}{1 + e^{-0.85 + 0.42 male}}\\
\hat{\rho}_{bush}  &=& 1 - \hat{\rho}_{clinton} - \hat{\rho}_{perot}
\end{aligned} $$

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", pattern = "extract")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

This is our data generating mechanism.

Exercise 10

Run tidy() on fit_nes with the argument conf.int set equal to TRUE. The returns 95% intervals for all the parameters in our model.

tidy(..., conf.int = ...)

tidy(fit_nes, conf.int = TRUE)

tidy() is part of the broom package, used to summarize information from a wide variety of models.

Exercise 11

tidy(fit_nes, conf.int = TRUE) |> 
  select("y.level","term", "estimate") |>
  filter(y.level == "Clinton" & term == "(Intercept)")

Write a sentence interpreting the estimate for the Intercept of Clinton.

question_text(NULL,
    message = "Our (best) guess/estimate for the probability of female voting for Clinton is 0.455, or 45.5%.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Recall that we cannot have both "Male" and "Female" categories in the model. Instead, the Female category is incorporated into the intercept and becomes the reference case. Thus, it represents the the probability of female voting for Clinton.

Exercise 12

tidy(fit_nes, conf.int = TRUE) |> 
  select("y.level","term", "estimate") |>
  filter(y.level == "Clinton" & term == "sexMale")

Write a sentence interpreting the estimate for sexMale.

question_text(NULL,
    message = "When comparing the probability of voting for Clinton of men and women, our guess/estimate is that the probability of voting for Clinton in Male decreases by 0.255 or 25.5%.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Note how, whenever we consider non-treatment variables, we must never use terms like "cause," "impact" and so on. We can't make any statement which implies the existence of more than one potential outcome based on changes in non-treatment variables. We can't make any claims about within row effects. Instead, we can only compare across rows. Always use the phrase "when comparing X and Y" or something very similar.

Exercise 13

tidy(fit_nes, conf.int = TRUE) |> 
  select("y.level","term", "conf.low", "conf.high") |>
  filter(y.level == "Clinton" & term == "(Intercept)")

Write a sentence interpreting the confidence interval for Intercept.

question_text(NULL,
    message = "There is a 95% chance that the true value of the Intercept is within the range of 0.309 and 0.602.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

When looking at the confidence interval, we examine whether it excludes zero. If not, then we can't be sure if the relationship is positive or negative.

Exercise 14

tidy(fit_nes, conf.int = TRUE) |> 
  select("y.level","term", "conf.low", "conf.high") |>
  filter(y.level == "Clinton" & term == "sexMale")

Write a sentence interpreting the confidence interval for sexMale.

question_text(NULL,
    message = "There is a 95% chance that the true value of the coefficient of `sexMale` is within the range -0.473 and -0.0381.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Dummy variables must always be interpreted in the context of the base value for that variable, which is always included in the intercept. For example, the base value here is "Female/Male." (The base value is the first alphabetically by default for character variables. However, if it is a factor variable, you can change that by setting the order of the levels by hand.)

Exercise 15

For interactive use, tidy() is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression().

Run tbl_regression(fit_height).

tbl_regression(..)

tbl_regression(fit_nes)

See this tutorial for a variety of options for customizing your table.

Exercise 16

Create a new code chunk in analysis.qmd. Add a code chunk option: label: table. Add this code to the code chunk.

tbl_regression(fit_nes)

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", pattern = "tbl_regression")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Exercise 17

Add a sentence to your project summary.

Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."

Recall the beginning of our version of the summary:

XX: Include what we suggested at the end of Justice

question_text(NULL,
    message = "Understanding the voter preference of different genders is essential for a candidate to design a campaign strategy. Using data from the National Election Studies survey of US citizens, we seek to understand the relationship between voter preference and sex in the 1992 Presidential election. However, since not everyone participates in the survey, the data might not be representative of the entire population. We modeled voting result as a multinomial function of sex.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K, and then commit/push.

Temperance

Temperance is a tree which as for its root very little contentment, and for its fruit calm and peace. - Buddha

Exercise 1

In your own words, describe the use of Temperance in data science.

question_text(NULL,
    message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest.

Exercise 2

Load the marginaleffects package.

library(...)

library(marginaleffects)

We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Exercise 3

What is the specific question we are trying to answer?

question_text(NULL,
    message = "What was the difference in voting preference of men and women in the 1992 US Presidential election among supporters of the three leading candidates: Clinton, Bush and Perot?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.

Exercise 4

Enter this code into the exercise code block and hit "Run Code."

plot_predictions(fit_nes,
                 by = "sex", 
                 type = "prob",
                 draw = FALSE)

plot_predictions(fit_nes, 
                 by = ...,
                 type = ...,
                 draw = ...)

plot_predictions(fit_nes,
                 by = "sex", 
                 type = "prob",
                 draw = FALSE)

This code returns the estimated probabilities of voting for each candidate, for male and female.

Exercise 5

Let's make a nice plot to visualize our results. Continue the pipe with ggplot(), setting x equals group, y equals estimate, and color equals sex in the aes() function. Note that this will return a plain graph since we have not mapped any data points to the graph yet.

... |>
  ggplot(aes(x = ..., y = ..., color = ...))

plot_predictions(fit_nes,
                 by = "sex", 
                 type = "prob",
                 draw = FALSE) |> 
  ggplot(aes(x = group, y = estimate, color = sex))

Exercise 6

Add a geom_point() layer to the graph. Set the argument size equal 3 and position equal position_dodge(width = 0.5).

... |>
  ggplot(aes(x = ..., y = ..., color = ...)) + 
  geom_point(size = ..., position = ...)

plot_predictions(fit_nes,
                 by = "sex", 
                 type = "prob",
                 draw = FALSE) |> 
  ggplot(aes(x = group, y = estimate, color = sex)) + 
  geom_point(size = 3, position = position_dodge(width = 0.5))

Exercise 7

Add a geom_errorbar() layer to the graph. In the aes() argument, set ymin equal conf.low, ymax equal conf.high, width equal 0.2 and position equal position_dodge(width = 0.5).

... |>
  ggplot(aes(x = ..., y = ..., color = ...)) + 
  geom_point(size = ..., position = ...) +
  geom_errorbar(aes(ymin = ..., ymax = ..., width = ..., position = ...))

plot_predictions(fit_nes,
                 by = "sex", 
                 type = "prob",
                 draw = FALSE) |> 
  ggplot(aes(x = group, y = estimate, color = sex)) + 
  geom_point(size = 3, position = position_dodge(width = 0.5)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2, position = position_dodge(width = 0.5))

Exercise 8

Finally add title, label for x-axis and y-axis to the graph. Remeber that your graph should look like this:

plot_predictions(fit_nes, 
                  by = "sex", 
                  type = "prob", 
                  draw = FALSE) |> 
    ggplot(aes(x = group, y = estimate, color = sex)) +
      geom_point(size = 3, position = position_dodge(width = 0.5)) +
      geom_errorbar(aes(ymin = conf.low, 
                        ymax = conf.high), 
                    width = 0.2, 
                    position = position_dodge(width = 0.5)) +
      labs(title = "Voting Preferences by Candidate and Sex",
           x = NULL,
           y = "Estimated Proportion",
           color = "Sex") +
      theme_minimal()

... |>
    ggplot(aes(x = ..., y = ..., color = ...)) +
      geom_point(size = ..., position = ...) +
      geom_errorbar(aes(ymin = ..., ymax = ..., width = ..., position = ...)) +
      labs(title = ..., x = ..., y = ..., color = ...) +
      theme_minimal()

plot_predictions(fit_nes, 
                  by = "sex", 
                  type = "prob", 
                  draw = FALSE) |> 
    ggplot(aes(x = group, y = estimate, color = sex)) +
      geom_point(size = 3, position = position_dodge(width = 0.5)) +
      geom_errorbar(aes(ymin = conf.low, 
                        ymax = conf.high), 
                    width = 0.2, 
                    position = position_dodge(width = 0.5)) +
      labs(title = "Voting Preferences by Candidate and Sex",
           x = NULL,
           y = "Estimated Proportion",
           color = "Sex") +
      theme_minimal()

Exercise 9

Add library(marginaleffects) to the analysis.qmd setup code chunk.

Create a new code chunk. Label it with label: plot. Copy/paste the code which creates your graphic.

Command/Ctrl + Shift + K to ensure that it all works as intended.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Exercise 10

Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)

question_text(NULL,
    message = "Understanding the voter preference of different genders is essential for a candidate to design a campaign strategy. Using data from the National Election Studies survey of US citizens, we seek to understand the relationship between voter preference and sex in the 1992 Presidential election. However, since not everyone participates in the survey, the data might not be representative of the entire population. We modeled voting result as a multinomial function of sex. Women are most likely to support Clinton. About 53% of women claim to support Clinton, although that number could be as high as 58% or as low as 48%.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K.

Exercise 11

Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "As we would tell our boss, it would not be shocking to find out that the voting preference was less or more than our estimate. This is because a lot of the assumptions we make during the process of building a model, the processes in Wisdom, are subject to error. Perhaps our data did not match the future as well as we had hoped. In such cases, increase the confidence interval since the assumptions of your model are always false.",
    incorrect = NULL,
    rows = 3)

Exercise 12

Rearrange the material in your QMD so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K to ensure that everything works.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

This is the version of your QMD file which your teacher is most likely to take a close look at.

Exercise 13

Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the resulting url below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Add rsconnect to the .gitignore file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.

Exercise 14

Copy/paste the url to your Github repo.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Summary

This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

PPBDS/primer.tutorials documentation built on April 3, 2025, 3:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PPBDS/primer.tutorials Tutorials for Preceptor's Primer for Bayesian Data Science

In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

Introduction

The Question

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Wisdom

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Exercise 19

Exercise 20

Exercise 21

Exercise 22

Justice

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Courage

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Temperance

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

PPBDS/primer.tutorials
Tutorials for Preceptor's Primer for Bayesian Data Science