library(learnr)
library(tutorial.helpers)
library(gt)

library(tidyverse)
library(primer.data)
library(tidymodels)
library(equatiomatic)
library(broom)
library(marginaleffects)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local") 

x <- shaming |> 
  mutate(civ_engage = primary_00 + primary_02 + primary_04 + 
               general_00 + general_02 + general_04) |> 
  select(primary_06, treatment, sex, age, civ_engage) |> 
  mutate(voter_class = factor(
    case_when(
      civ_engage %in% c(5, 6) ~ "Always Vote",
      civ_engage %in% c(3, 4) ~ "Sometimes Vote",
      civ_engage %in% c(1, 2) ~ "Rarely Vote"),
         levels = c("Rarely Vote", 
                    "Sometimes Vote", 
                    "Always Vote"))) 

fit_vote <- logistic_reg(engine = "glm") |> 
  fit(as.factor(primary_06) ~ 
        age + sex + treatment*voter_class, 
      data = x)

tidy_vote <- tidy(fit_vote, conf_int = TRUE)


Introduction

This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

The world confronts us. Make decisions we must.

Imagine that you are running for Governor of Texas in the next election. You have a campaign budget. Your goal is to win the election. Winning the election involves convincing people to vote for you and getting your supporters to vote. Should you send postcards to registered voters likely to vote for you? What should those postcards say?

The Question

A prudent question is one half of wisdom. - Francis Bacon

What is the causal effect on voting of receiving a postcard which encourages one to vote?

Exercise 1

Load tidyverse.


library(...)
library(tidyverse)

The data come from “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment” by Gerber, Green, and Larimer (2008).

Exercise 2

Load the primer.data package.


library(...)
library(primer.data)

A version of the data is available in the shaming tibble.

Exercise 3

Familiarize yourself with the data by loading primer.data at the Console and then typing ?shaming.

Find the year which this experiment took place and how many households were in the experiment. Write your answers below.

question_text(NULL,
    message = "The experiment was conducted prior to the August 2006 primary election in Michigan. A total of 180,000 households were part of this experiment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In their published article, the authors note that “Only registered voters who voted in November 2004 were selected for our sample.” After this, the authors found their history then sent out the mailings. Thus, anyone who did not vote in the 2004 general election is excluded, by definition.

Exercise 4

Voting is the broad topic of this tutorial. Given that topic, which variable in shaming should we use as our outcome variable?

question_text(NULL,
    message = "The outcome is `primary_06`, which indicates whether the resident voted in the 2006 primary election.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

There were many more people age 18-22 registered to Michigan in 2006 than age 27-31. It is unclear to use why that should be.

x |> 
ggplot(aes(x = age, fill = factor(primary_06))) +
    geom_bar(position = "dodge") +
    scale_fill_manual(values = c("0" = "skyblue", "1" = "coral"),
                      labels = c("0" = "Did Not Vote", 
                                 "1" = "Voted"),
                      name = NULL) +
    scale_y_continuous(labels = label_comma()) +
    labs(title = "Voting Behavior in the 2006 Michigan Primary",
         subtitle = "Old people are much more likely to vote",
         x = "Age",
         y = NULL,
         caption = "Gerber, Green, and Larimer (2008)")

Regardless, the central lesson is always the same: You can never look at your data too much.

Exercise 5

Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.

Describe this imaginary variable and how might we manipulate its value.

question_text(NULL,
    message = "Imagine a variable called `phone_call` which has a value of `1` if the person received a phone call urging them to vote and `0` if they did not receive such a phone call. We, meaning the organization in charge of making such phone calls, can manipulate this variable by deciding, either randonly or otherwise, whether or not we will call a specific individual.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.

Exercise 6

Given our imaginary treatment variable phone_call, how many potential outcomes are there for each person? Explain why.

question_text(NULL,
    message = "There are 2 potential outcomes because the treatment variable `phone_call` takes on 2 posible values: received a get-out-the-vote phone call or did not.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself.

Exercise 7

In a few sentences, specify two different values for the imaginary treatment variable phone_call, for a single unit, and then guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.

question_text(NULL,
    message = "For a given person, assume that the value of the treatment variable might be received a phone call or did not receive phone call. If the person gets the phone call, then her voting behavior would be that she did vote. If the person gets no phone call, then her voting behavioor might be not to vote. The causal effect on the outcome of a treatment of receiving-call versus no-call is voting minus not voting --- i.e., the difference between two potential outcomes --- which does not have a numeric value. That difference is still the causal effect, even if we can't assign a number to it. In many cases, we will just assign arbitray numbers to the outcomes --- say 1 for voting and 0 for not-voting. Doing so allows us to repory a numeric causal effect. But keep in mind that any such numbers are arbitrary.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The the definition of a causal effect as the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.

Any causal connection means exploring the within row different between two potential outcomes. We don't need to look at any other rows to have that conversation.

Exercise 8

Let's consider a predictive model. Which variable in shaming do you think might have an important connection to primary_06?

question_text(NULL,
    message = "The person's `age` is probably connected to `primary_06`, but so are other variables like `treatment` and past voting behavior.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be a treatment variable. We assuming that the values of all covariates are "fixed."

Exercise 9

Write a few sentences which specify two different groups of voters with different values for age. Explain that the outcome variable might differ between these two groups.

question_text(NULL,
    message = "Some people might have a value for `age` younger than 40. Others might have a value older than 40. Those two groups will, on average, have different values for the outcome variable.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.

Exercise 10

Write a causal question which connects the outcome variable primary_06 to a covariate of interest.

question_text(NULL,
    message = "What is the causal effect of postcards on voting in the 2006 Michigan election?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

It is OK if you did not write what is written in the answer! There are lots of interesting questions which one might write.

For this tutorial, our question is:

What is the causal effect of postcards on voting?

Exercise 11

What is the Quantity of Interest which might help us to explore the answer to our question?

question_text(NULL,
    message = "The average causal effect on voting of receiving a 'Neighbors' post card.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.

Wisdom

All we can know is that we know nothing. And that’s the height of human wisdom. - Leo Tolstoy

Our question:

What is the causal effect of postcards on voting?

Exercise 1

In your own words, describe the key components of Wisdom when working on a data science problem.

question_text(NULL,
    message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Can we use data from a primary election in Michigan in 2006 to predict behavior in a general election in Texas today?

Exercise 2

Define a Preceptor Table.

question_text(NULL,
    message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantities of interest.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The more experience you get as a data scientist, the more that you will come to understand that the Four Cardinal Virtues are not a one-way path. Instead, we circle round and round. Our initial question, instead of being fixed forever, is modified as we learn more about the data, and as we experiment with different modeling approaches.

Exercise 3

Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."

question_text(NULL,
    message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This problem is causal so one of the covariates is a treatment. In our problem, the treatment is the type of postcard that the resident receives.

Exercise 4

Create a Github repo called n-parameters. Make sure to click the "Add a README file" check box.

Connect the n-parameters Github repo to an R project on your computer. Name the R project n-parameters also.

Select File -> New File -> Quarto Document .... Provide a title ("N Parameters") and an author (you). Render the document and save it as causal_effect.qmd.

Edit the .gitignore by adding *Rproj. Save and commit this in the Git tab. Push the commit.

In the Console, run:

show_file(".gitignore")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Remove everything below the YAML header from causal_effect.qmd and render the file. Command/Ctrl + Shift + K renders the file, this automatically saves the file as well.

Exercise 5

What are the units for this problem?

question_text(NULL,
    message = "The units of our Preceptor Table are individual voters in Texas around the time of the next election.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

There is no one correct answer for these types of questions, it all depends on the question we are answering. We could have focused on all voters throughout America. For this tutorial, we will be focusing on just individual voters in Texas.

Exercise 6

What moment in time does the Preceptor Table refer to? It might be helpful to refer to the N Parameters chapter.

question_text(NULL,
    message = "We care about the upcoming Texas election.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

There is no one correct answer, just like many other parts of the Preceptor Table. It all depends on the question we are trying to answer.

We can include the time in our question:

What will be the causal effect of postcards on voting in the 2026 Texas election?

Exercise 7

What is the fundamental problem of causal inference?

question_text(NULL,
    message = "The fundamental problem of causal inference is that we can only observe one potential outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

If we observe that a person who got a postcard ended up voting, we can never know for sure that the postcard caused that person to vote because we will never know what the outcome would have been if that person didn't get a postcard.

Exercise 8

How does the motto "No causal inference without manipulation." apply in this problem?

question_text(NULL,
    message = "If we did not have any manipulation, so we did not send out any postcards and instead did an observational study, then we would never be able to create a causal inference. We need to have treatments so we can compare the outcome between the different groups of treatments.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We have to choose a variable that we can change to be the treatment. If we do not have such variable that we can manipulate, then we would have to create a predictive model instead. For example, if we were focused on household income, one conclusion may be: Rich people are more likely to vote than poor people. Correlation does not mean causation, we cannot assume that wealth directly causes people to have more civic engagement. In order to find a causation relationship, we would need to manipulate the treatment so that we can measure its effect on the outcome.

Exercise 9

Describe in words the Preceptor Table for this problem.

question_text(NULL,
    message = "The Preceptor Table has 5 columns. There is a column for the ID, two for the outcomes: Voting After Control and Voting After Treatment. There two covariates: Treatment and Engagement. Each row represents one individual.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table for this problem looks something like this:

#| echo: false
tibble(ID = c("1", "2", "...", "10", "11", "...", "N"),
       voting_after_treated = c("1", "1", "...", "1", "0", "...", "1"),
       voting_after_control = c("1", "0", "...", "1", "1", "...", "0"),
       treatment = c("Yes", "No", "...", "Yes", "Yes", "...", "No"),
       engagement = c("1", "3", "...", "6", "2", "...", "2")) |>

  gt() |>
  tab_header(title = "Preceptor Table") |> 
  cols_label(ID = md("ID"),
             voting_after_treated = md("Voting After Treatment"),
             voting_after_control = md("Voting After Control"),
             treatment = md("Treatment"),
             engagement = md("Engagement")) |>
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(ID))) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(ID))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(ID)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Covariates", columns = c(treatment, engagement)) |>
  tab_spanner(label = "Outcomes", columns = c(voting_after_control, voting_after_treated))

Exercise 10

In causal_effect.qmd, load the tidyverse and the primer.data packages in a new code chunk. Label it the set up by adding #| label: setup. Render the file.

Notice that the file does not look good because it is has code that is showing and it also has messages. To take care of this, add #| message: false to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all echo from the whole file:

execute: 
  echo: false

In the Console, run:

show_file("causal_effect.qmd", start = -5)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Render again. Everything looks nice because we have added code to make the file look better and more professional.

Exercise 11

Run glimpse() on shaming.


glimpse(...)
glimpse(shaming)

glimpse() gives us a look at the raw data contained within the shaming data set. The variables which begin with primary or general indicate whether or not the indvidual voted in thos elections.

Exercise 12

Pipe shaming to count() with treament inside of it.


shaming |> 
  count(...)
shaming |> 
  count(treatment)

Four types of treatments were used in the experiment, with voters receiving one of the four types of mailing. All of the mailing treatments carried the message, “DO YOUR CIVIC DUTY - VOTE!”.

Exercise 13

Pipe shaming to the following:

mutate(
  civ_engage = primary_00 + primary_02 + primary_04 + 
               general_00 + general_02 + general_04) |> 
select(primary_06, treatment, sex, age, civ_engage)

shaming |> 
  mutate(...)
shaming |> 
  mutate(
    civ_engage = primary_00 + primary_02 + primary_04 + 
                 general_00 + general_02 + general_04) |> 
  select(primary_06, treatment, sex, age, civ_engage)

We want to create a variable which captures the amount of civic engagement for each person. Since we don't have data about any other political activities, we define civ_engage as the number of elections participated in over the prior 6 elections.

Exercise 14

Continue the pipe with the following:

mutate(voter_class = factor(
  case_when(
    civ_engage %in% c(5, 6) ~ "Always Vote",
    civ_engage %in% c(3, 4) ~ "Sometimes Vote",
    civ_engage %in% c(1, 2) ~ "Rarely Vote"),
       levels = c("Rarely Vote", 
                  "Sometimes Vote", 
                  "Always Vote"))) 

shaming |> 
  mutate(
    civ_engage = primary_00 + primary_02 + primary_04 + 
                 general_00 + general_02 + general_04) |> 
  select(primary_06, treatment, sex, age, civ_engage) |> 
  mutate(voter_class = factor(
  case_when(
    civ_engage %in% c(5, 6) ~ "Always Vote",
    civ_engage %in% c(3, 4) ~ "Sometimes Vote",
    civ_engage %in% c(1, 2) ~ "Rarely Vote"),
       levels = c("Rarely Vote", 
                  "Sometimes Vote", 
                  "Always Vote"))) 

Classifying voters into three large groups, based on their voting history, will allow us to examine any heterogeneity in the causal effects. That is, perhaps receiving a postcard has a different average causal effect on people who always vote than it does on people who never vote.

x |> 
  sample_frac(0.5) |> 
  ggplot(aes(x = civ_engage, y = primary_06)) +
    geom_jitter(alpha = 0.03, height = 0.1) +
    scale_x_continuous(breaks = 1:6) + 
    scale_y_continuous(breaks = c(0, 1), labels = c("No", "Yes")) +
    labs(title = "Civic Engagement and Voting Behavior in Michigan",
         subtitle = "Past voting predicts future voting.",
         x = "Civic Engagement",
         y = "Voted in 2006 Primary Election",
         caption = "Random sample of 5% of the data from Gerber, Green, and Larimer (2008)")

It is certainly the case that people who had voted more in the past were more likely to vote in the 2006 primary.

Exercise 15

We have assigned the result of this pipe to an object named x. Type x and hit "Run Code."


x
x

Of course, when doing the analysis, you don’t know when you start what you will be using at the end. Data analysis is a circular process. We mess with the data. We do some modeling. We mess with the data on the basis of what we learned from the models. With this new data, we do some more modeling. And so on.

Exercise 16

Add the code from the pipe above and set it to an object named x in a new code chunk in causal_effect.qmd.

x <- shaming |> 
  mutate(civ_engage = primary_00 + primary_02 + primary_04 + 
               general_00 + general_02 + general_04) |> 
  select(primary_06, treatment, sex, age, civ_engage) |> 
  mutate(voter_class = factor(
    case_when(
      civ_engage %in% c(5, 6) ~ "Always Vote",
      civ_engage %in% c(3, 4) ~ "Sometimes Vote",
      civ_engage %in% c(1, 2) ~ "Rarely Vote"),
         levels = c("Rarely Vote", 
                    "Sometimes Vote", 
                    "Always Vote"))) 

Render the file.

In the Console, run:

show_file("causal_effect.qmd", start = -5)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

In the .gitignore, add *_files so that we do not commit those junk files. Commit and push.

Exercise 17

In your own words, define "validity" as we use the term.

question_text(NULL,
    message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In order to consider the two data sets to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other.

Exercise 18

Provide one reason why the assumption of validity might not hold for the variable: primary_06. Use the words "column" or "columns" in your answer.

question_text(NULL,
    message = "The voting column in a our data is for a primary election in 2006 in Michigan. The voting coloumn in the Preceptor Table is for a general election in Texas in 2026. Those are not the same things.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Fortunately, at least for our continued use of this example, we will assume that validity holds. The outcome variable in our data and in our Preceptor Table are close enough — even though one is for a primary election while the other is for a general election — that we can just stack them.

Exercise 19

Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.

The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.

The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question.

Type your two sentences below.

question_text(NULL,
    message = "Efforts to get your supporters to vote have always been a part of US political campaigns. Using the data from a 2006 experiment in Michigan, we seek to forecast the causal effect on voter participation of sending postcards in the Texas gubernatorial general election of 2026.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to causal_effect.qmd, Command/Ctrl + Shift + K, and then commit/push.

Justice

It is in justice that the ordering of society is centered. - Aristotle

Exercise 1

In your own words, name the four key components of Justice for working on a data science problem.

question_text(NULL,
    message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.

Exercise 2

In your own words, define a Population Table.

question_text(NULL,
    message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.

Exercise 3

In your own words, define the assumption of "stability" when employed in the context of data science.

question_text(NULL,
    message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time referenced by the Preceptor Table.

Exercise 4

Provide one reason why the assumption of stability might not be true in this case.

question_text(NULL,
    message = "It is possible, for instance, that a postcard informing neighbors of voting status has a bigger effect in a world with more social media.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A change in time or the distribution of the data does not, in and of itself, demonstrate a violation of stability. Stability is about the parameters: $\beta_0$, $\beta_1$ and so on. Stability means these parameters are the same in the data as they are in the population as they are in the Preceptor Table.

Exercise 5

We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table. In your own words, define the assumption of "representativeness" when employed in the context of data science.

question_text(NULL,
    message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.

Exercise 6

We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "All the data is from Michigan, which is, by definition, not necessarily representative of all the other states in the country.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.

Exercise 7

We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "Even if Michigan were representative of the larger population, Texas certainly is not. That is, the rows in the Preceptor Table are not a random draw from the larger population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability looks across time periods. Representativeness looks within time periods, for the most part.

Exercise 8

In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.

question_text(NULL,
    message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.

Exercise 9

Provide one reason why the assumption of unconfoundedness might not be true (or relevant) in this case.

question_text(NULL,
    message = "The experiment should be randomized, but there is a possibility that the people who ran the experiment did not actually make it fully randomized. It is easy to lie and say that there was randomization, but we can not know for sure if this was truly random assignment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The great advantage of randomized assignment of treatment is that it guarantees unconfoundedness. There is no way for treatment assignment to be correlated with anything, including potential outcomes, if treatment assignment is random.

Exercise 10

Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention one specific problem which casts doubt on your approach.

question_text(NULL,
    message = "Using the data from an experiment to find out whether and to what extent people are motivated to vote by social pressure, we seek to forecast the causal effect on voter participation of sending postcards in the Texas gubernatorial general election of 2026. Stability might not be true because the way people view politics has changed from 2006 because of new things such as social media.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Edit the summary paragraph in causal_effect.qmd as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K, and then commit/push.

Courage

Courage is being scared to death, but saddling up anyway. - John Wayne

Exercise 1

In your own words, describe the components of the virtue of Courage for analyzing data.

question_text(NULL,
    message = "Courage starts with math, explores models, and then creates the data generating mechanism.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.

Exercise 2

Load the tidymodels package.


library(...)
library(tidymodels)

Because primary_06 is a binary variable, we assume that the outcome of voting (or not) is produced from a Bernoulli distribution.

$$ primary_06_i \sim Bernoulli(\rho) $$

Exercise 3

Load the broom package.


library(...)
library(broom)

Because the outcome variable has a Bernoulli distribution, the link function is logit. That is:

extract_eq(fit_vote$fit, intercept = "beta")

Exercise 4

Load the equatiomatic package.


library(...)
library(equatiomatic)

Recall that a categorical variable (whether character or factor) like treatment is turned into a $0/1$ "dummy" variable which is then re-named something like $treatmentNeighbors$. After all, we can't have words --- like "Neighbors" or "Civic Duty" --- in a mathematical formula, hence the need for dummy variables.

Exercise 5

Add library(tidymodels), library(broom), and library(equatiomatic) to the setup code chunk in causal_effect.qmd. Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("causal_effect.qmd", pattern = "tidymodels|broom|equatiomatic")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. Since treatment has five values --- "No Postcard", "Civic Duty", "Hawthorne", "Self", "Neighbors" --- the model creates four 0/1 dummy variables, giving them names like $treatmentCivic Duty$, $treatmentHawthorne$ and so on. The results for the first category, which is "No Postcard" are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.

It is convenient for the control condition to be included in the intercept.

Exercise 6

Because our outcome variable is binary, start to create the model by using logistic_reg(engine = "glm").


logistic_reg(engine = "glm")
logistic_reg(engine = "glm")

In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.

Exercise 7

Continue the pipe to fit(as.factor(primary_06) ~ age_z + sex + treatment + voter_class + treatment*voter_class, data = x). The "glm" engine requires that the outcome variable be a factor, so we need to transform primary_06 during the fitting process.


logistic_reg(engine = "glm") |> 
  fit(as.factor(primary_06) ~ age + sex + treatment + 
        voter_class + treatment*voter_class, 
      data = x)
logistic_reg(engine = "glm") |> 
  fit(as.factor(primary_06) ~ age + sex + treatment + 
        voter_class + treatment*voter_class, 
      data = x)

We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:

extract_eq(fit_vote$fit, 
           intercept = "beta", 
           use_coefs = TRUE,
           wrap = TRUE)

$$ \begin{aligned} \log\left[ \frac { \widehat{P( \operatorname{primary_06} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{primary_06} = \operatorname{1} )} } \right] &= -2.43 + 0.01(\operatorname{age}) + 0.04(\operatorname{sex}{\operatorname{Male}}) + 0.09(\operatorname{treatment}{\operatorname{Civic\ Duty}})\ + \ &\quad 0.07(\operatorname{treatment}{\operatorname{Hawthorne}}) + 0.2(\operatorname{treatment}{\operatorname{Self}}) + 0.36(\operatorname{treatment}{\operatorname{Neighbors}}) + 0.82(\operatorname{voter_class}{\operatorname{Sometimes\ Vote}})\ + \ &\quad 1.61(\operatorname{voter_class}{\operatorname{Always\ Vote}}) + 0.03(\operatorname{treatment}{\operatorname{Civic\ Duty}} \times \operatorname{voter_class}{\operatorname{Sometimes\ Vote}}) + 0.06(\operatorname{treatment}{\operatorname{Hawthorne}} \times \operatorname{voter_class}{\operatorname{Sometimes\ Vote}}) + 0.05(\operatorname{treatment}{\operatorname{Self}} \times \operatorname{voter_class}{\operatorname{Sometimes\ Vote}})\ + \ &\quad 0.04(\operatorname{treatment}{\operatorname{Neighbors}} \times \operatorname{voter_class}{\operatorname{Sometimes\ Vote}}) - 0.05(\operatorname{treatment}{\operatorname{Civic\ Duty}} \times \operatorname{voter_class}{\operatorname{Always\ Vote}}) + 0.06(\operatorname{treatment}{\operatorname{Hawthorne}} \times \operatorname{voter_class}{\operatorname{Always\ Vote}}) - 0.01(\operatorname{treatment}{\operatorname{Self}} \times \operatorname{voter_class}{\operatorname{Always\ Vote}})\ + \ &\quad 0.01(\operatorname{treatment}{\operatorname{Neighbors}} \times \operatorname{voter_class}_{\operatorname{Always\ Vote}}) \end{aligned} $$

Exercise 8

Behind the scenes of this tutorial, an object called fit_vote has been created which is the result of the code above. Type fit_vote and hit "Run Code." This generates the same results as using print(fit_vote).


fit_vote
fit_vote

The code formula includes sex, a character variable with two possible values: "Male" and "Female".

The math formula includes sexMale, a 0/1 dummy variable.

Exercise 9

Create a new code chunk in causal_effect.qmd. Add two code chunk options: label: model and cache: true. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_XX.

Command/Ctrl + Shift + K. It may take some time to render causal_effect.qmd, depending on how complex your model is. But, by including cache: true you cause Quarto to cache the results of the chunk. The next time you render causal_effect.qmd, as long as you have not changed the code, Quarto will just load up the saved fitted object.

To confirm, Command/Ctrl + Shift + K again. It should be quick.

At the Console, run:

tutorial.helpers::show_file("causal_effect.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

Exercise 10

Create another code chunk in XX.qmd. Add the chunk option: label: math. In that code chunk, add something like the below. You may find it useful to add the coef_digits argument to show fewer significant digits after the decimal.

extract_eq(fit_vote$fit, 
           intercept = "beta", 
           use_coefs = TRUE,
           wrap = TRUE)

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("XX.qmd", pattern = "extract")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

When you render your document, this formula will appear.

extract_eq(fit_vote$fit, 
           intercept = "beta", 
           use_coefs = TRUE,
           wrap = TRUE)

This is our data generating mechanism.

Exercise 11

Add two sentence to your project summary.

First, mention a weakness in your model, derived from the questions above about the key assumptions of a data science problem.

Second, explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/oridinal] function of X [and maybe other covariates]."

Recall the beginning of our version of the summary:

Efforts to get your supporters to vote have always been a part of US political campaigns. Using the data from a 2006 experiment in Michigan, we seek to forecast the causal effect on voter participation of sending postcards in the Texas gubernatorial general election of 2026.

question_text(NULL,
    message = "The rise of social media use may have changed the efficacy of postcards in changing individual behavior. We model voting as a logistic function of treatment --- meaning type of postcard received, if any --- along with prior voting behavior, sex and age.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to causal_effects.qmd, Command/Ctrl + Shift + K, and then commit/push.

Temperance

Temperance is simply a disposition of the mind which binds the passion. - Thomas Aquinas

Exercise 1

In your own words, describe the use of Temperance in data science.

question_text(NULL,
    message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began.

Exercise 2

Load the marginaleffects package.


library(...)
library(marginaleffects)

We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Exercise 3

What is the general topic we are investigating? What is the specific question we are trying to answer?

question_text(NULL,
    message = "What is the causal effect, on the probability of voting, of different postcards on voters of different levels of political engagement?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.

Exercise 4

Enter this code into the exercise code block and hit "Run Code."

plot_comparisons(fit_vote$fit,
                 variables = "treatment",
                 by = "voter_class",
                 newdata = "balanced",
                 type = "response")

plot_comparisons(fit_vote$fit,
                 variables = "treatment",
                 by = "voter_class",
                 newdata = "balanced",
                 type = "response")

We are interested in the average treatment effect of postcards. There are 4 different postcards, each of which can be compared to what would have happened if the voter did not receive any postcard.

Exercise 5

Add library(marginaleffects) to the XX.qmd setup code chunk. Create a new code chunk. Label it with label: plot. Copy/paste the code which creates your graphic.

Command/Ctrl + Shift + K to ensure that it all works as intended.

At the Console, run:

tutorial.helpers::show_file("causal_effect.qmd", pattern = "marginaleffects")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

These four treatment effects, however, are heterogeneous. They vary depending on an individual’s voting history, which we organize into three categories: Rarely Vote, Sometimes Vote and Always Vote. So, we have 12 different average treatment effects, one for each possible combination of postcard and voting history.

Exercise 6

Add the final one to two sentences to your project summary. This sentence must provide an estimate for at least one Quantity of Interest and its associated confidence interval. Your Quanity of Interest can be different than the one you started with.

Recall our current version of the summary. (It is OK if your version is different.)

Efforts to get your supporters to vote have always been a part of US political campaigns. Using the data from a 2006 experiment in Michigan, we seek to forecast the causal effect on voter participation of sending postcards in the Texas gubernatorial general election of 2026. The rise of social media use may have changed the efficacy of postcards in changing individual behavior. We model voting as a logistic function of treatment --- meaning type of postcard received, if any --- along with prior voting behavior, sex and age.

question_text(NULL,
    message = "The causal effect of receiving a postcard varies based both on its message and the level of political engagement of the person receiving it. ",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Edit your summary paragraph in causal_effect.qmd as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K.

Exercise 7

Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.

question_text(NULL,
    message = "Perhaps our data did not match the future as well as we had hoped. We estimate that the treatment effect of the Neighbors postcard to be between 8% to 10%, but the reality might not be in this interval.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The world is always more uncertain than our models would have us believe.

Ultimately, we try to account for our uncertainty in our estimates. Even with this safeguard, we aren’t surprised if we are a bit off.

Exercise 8

Rearrange the material in causal_effect.qmd so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K to ensure that everything works.

At the Console, run:

tutorial.helpers::show_file("causal_effect.qmd")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 20)

This is the version of your QMD file which your teacher is most likely to take a close look at.

Exercise 9

Publish causal_effect.qmd to Rpubs. Choose a sensible slug. Copy/paste the url below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Add rsconnect to the .gitignore file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.

Summary

This tutorial covered Chapter 10: N Parameters of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.




PPBDS/primer.tutorials documentation built on April 3, 2025, 3:11 p.m.