library(learnr)
library(tutorial.helpers)
library(gt)

library(tidyverse)
library(primer.data)
library(tidymodels)
library(marginaleffects)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local") 

df1 <- governors |> 
  filter(year > 1945) |> 
  select(last_name, year, state, sex, lived_after, election_age)


 #fit_all <- lm(lived_after ~ election_age*sex, data = df1)

 #write_rds(fit_all, "data/fit_all.rds")

fit_all <- read_rds("data/fit_all.rds")

governor <- data.frame(sex = c("Male", "Female"), election_age = 52)

coefficients <- coef(fit_all)

# Extract individual coefficients by name
intercept <- coefficients["(Intercept)"]
election_age <- coefficients["election_age"]
female_intercept <- intercept
female_slope <- election_age
interaction_term <- coefficients["election_age:sexMale"]
sex_male <- coefficients["sexMale"]

male_intercept <- intercept + sex_male
male_slope <- election_age + interaction_term


Introduction

This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

The world confronts us. Make decisions we must.

Imagine that you are a 40-year-old woman who is considering running for governor. One thing you might be interested in is how long candidates for governor live.

The Question

A Prudent Question is one-half of wisdom. - Francis Bacon

Exercise 1

Load tidyverse.


library(tidyverse)
library(tidyverse)

We will be working with information about candidates for governor in the United States. The data was collected for “Longevity Returns to Political Office” by Sebastian Barfort, Robert Klemmensen, and Erik Gahner Larsen (pdf).

Exercise 2

Load the primer.data package.


library(primer.data)
library(primer.data)

A version of the data from Barfort et al (2020) is available in the governors tibble, which is included in the primer.data package.

Exercise 3

After loading primer.data in your Console, type ?governors in the Console, and paste in the Description below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The abstract of the paper that the tibble is based on:

Does political office cause worse or better longevity prospects? Two perspectives in the literature offer contradicting answers. First, increased income, social status, and political connections obtained through holding office can increase longevity. Second, increased stress and working hours associated with holding office can have detrimental effects on longevity. To provide causal evidence, we exploit a regression discontinuity design with unique data on the longevity of candidates for US gubernatorial office. The results show that politicians winning a close election live 5–10 years longer than candidates who lose.

Exercise 4

Longevity i.e., how long people live, is the broad topic of this tutorial. Given this topic, which variable in governors should we use as our outcome variable?

Hint: It might be useful to preview the governors tibble in the Console (to get an idea of the columns)

question_text(NULL,
    message = "It would be most logical to use `lived_after`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

We will use lived_after as our outcome variable.

ggplot(df1, aes(x = lived_after)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +  # Adjusting the colors and bin size
  labs(
    title = "Distribution of Years Lived After Election",
    x = "Years Lived After Election",
    y = "Count"
  ) +
  theme_minimal(base_size = 15) +  # Increasing the base text size for readability
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 18),  # Centering and styling the title
    axis.title.x = element_text(size = 14),  # Adjusting x-axis label size
    axis.title.y = element_text(size = 14),  # Adjusting y-axis label size
    axis.text = element_text(size = 12)  # Adjusting axis text size
  ) +
  scale_x_continuous(breaks = seq(0, max(df1$lived_after, na.rm = TRUE), by = 5))  # Custom breaks for x-axis

Exercise 5

Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.

Describe this imaginary variable and how might we manipulate its value.

question_text(NULL,
    message = "We will use `exercise` as a treatment variable. If given a value of 1, they exercise regularly. If 0, they don't 0 at all, or very little. We can seperate these two by having a control group of governors who did not exercise regularly while they were in office, vs a groupd that did.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.

Exercise 6

Given our (imaginary) treatment variable exercise, how many potential outcomes are there for each [governor: unit]? Explain why.

question_text(NULL,
    message = "There are two potential outcomes because the treatment variable `exercise` takes on two posible values: exercising regularly or not exercising regularly",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. This is a conceptual framework we apply to the data. It is never inherent in the data itself.

Exercise 7

In a few sentences, specify the two different values for the imaginary treatment variable exercise, for a single unit, guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.

question_text(NULL,
    message = "For a given governor, assume that the value of the treatment variable might be 1, exrcising regularly or 0 not exrcising regularly. If the governor gets 1, then `lived_after` would be 75. If the governor gets 0, then `lived_after` would be 70. The causal effect on the outcome of a treatment of [exercising versus not exrcising is 75 - 70 --- i.e., the difference between two potential outcomes --- which equals 5 years, which is the causal effect.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The the definition of a causal effect is the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.

Any causal connection means exploring the within row difference between two potential outcomes. We don't need to look at any other rows to have that conversation.

Exercise 8

Let's consider a predictive model. Which variable in governors do you think might have an important connection to lived_after?

question_text(NULL,
    message = "The covariate that is the most connected to `lived_after` is election_age. If this were a predictive model, we would seek to predict how long a governor would live based on what age they were when the election happened.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be a treatment variable. We assuming that the values of all covariates are "fixed."

Exercise 9

Specify two different groups of governors which have specific value for election_age and which might have different average values for the lived_after.

question_text(NULL,
    message = "For governors, the covariate age can be used to define two distinct groups. Group 1 consists of Younger Governors aged 40-50. This group is expected to have a higher average lived_after, potentially around 80 years, due to having more years ahead to live. Group 2 includes Older Governors aged 60-70. This group is expected to have a lower average lived_after, approximately 70 years, due to their advanced age. Comparing these groups helps understand how age impacts the longevity of governors.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.

Exercise 10

Write a predictive question which connects the outcome variable lived_after to election_age, the covariate of interest.

question_text(NULL,
    message = "How does the covariate `election_age` change the outcome variable `lived_after` among governors?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We call the answer to this question is your "Quantity of Interest."

Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.

Wisdom

The only true wisdom is in knowing you know nothing. - Socrates

Our question:

How long do political candidates live after the election?

Exercise 1

In your own words, describe the key components of Wisdom for working on a data science problem.

question_text(NULL,
    message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The central problem for Wisdom is: Can we use data from governors to understand the variables/relationships in the preceptors table?

Exercise 2

Define a Preceptor Table.

question_text(NULL,
    message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, it is easy to calculate the quantities of interest.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.

Exercise 3

Describe the key components of Preceptor Tables in general, without worrying about this specific problem.

question_text(NULL,
    message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This problem is causal so one of the covariates is a treatment. In our problem, the treatment is the type of postcard that the resident receives. There is a potential outcome for each possible value of the treatment.

Exercise 4

Create a Github repo called analysis. Make sure to click the "Add a README file" check box.

Connect the Github repo to an R project on your computer. Give the R project the same name.

Select File -> New File -> Quarto Document .... Provide a title -- "5 Paramters" -- and an author (you). Render the document and save it as analysis.qmd.

Edit the .gitignore by adding *Rproj. Save and commit this in the Git tab. Push the commit.

In the Console, run:

show_file(".gitignore")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 6)

Remove everything below the YAML header from your QMD and render the file. Command/Ctrl + Shift + K first saves the file and then renders it.

Exercise 5

What are the units for this problem?

HINT: Ask yourself "Who's being affected in this scenario?"

question_text(NULL,
    message = "Our units for this problem would be individual candidates because the questions are about the life expectancy of those candidates.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.

Exercise 6

What is the outcome for this problem?

HINT: Ask yourself "What are we trying to get out of this problem?"

NOTE: The outcome is what WE want to get out of the data, based on our QUESTION. It does NOT concern the POSSIBLE values that we CAN extract.

question_text(NULL,
    message = "The number of years a candidate lives after the election.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Regardless, the central lesson is always the same: You can never look at your data too much.

Exercise 7

What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?

question_text(NULL,
    message = "Any variable that can be connected to the outcome variable is acceptable. We will use `sex`, `elections_age` and `year`",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables which we have data for. Third, it is the set of covariates which we end up using in the model.

Exercise 8

What are the treatments, if any, for this problem?

question_text(NULL,
    message = "Since this is a predictive model there is no treatment per se. Any variable which one might consider a treatment is just another covariate in this context.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated, thereby, creating two or more different potential outcomes for each unit.

Exercise 9

What moment in time does the Preceptor Table refer to?

question_text(NULL,
    message = "At this point in time, we don't know what moment in time it should cover because the question itself doesn't state any clear time frame. For all we know, it could include entries from the past, present, and even future!",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

ggplot(df1, aes(x = election_age, y = lived_after, color = sex)) +
  geom_point() + 
  labs(title = "Years Lived After Election vs. Election Age",
       x = "Election Age",
       y = "Years Lived After Election") +
  theme_minimal()

Exercise 10

Define a causal effect. (Note that the model in this tutorial is predictive, not causal. We just want to make sure you understand what a causal model is.)

question_text(NULL,
    message = "A causal effect is the difference between two potential outcomes.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

According to the Rubin Causal Model, there must be two (or more) potential outcomes for any discussion of causation to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes.

Exercise 11

What is the fundamental problem of causal inference?

question_text(NULL,
    message = "The fundamental problem of causal inference is that we can only observe one potential outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

If the treatment variable is continuous (like income), then there are lots and lots of potential outcomes, one for each possible value of the treatment variable

Exercise 12

How does the motto "No causal inference without manipulation." apply in this problem?

question_text(NULL,
    message = "The motto does not apply because this is a predictive, not causal, model.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 13

Describe in words the Preceptor Table for this problem.

question_text(NULL,
    message = "Our Preceptor Table would contain a list of candidates on the right-hand side of the table, and on the left-hand side, would contain variables concerning their sex, the year that they got elected, the age at which they got elected, and the number of years that they lived after getting elected.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table for this problem looks something like this:

#| echo: false

tibble(ID = c("Candidate 1", "Candidate 2", "...", "Candidate 10", "Candidate 11", "...", "Candidate N"),
       lived_after = c("12", "7", "...", "10", "11", "...", "6"),
       year_elected = c("2000", "2012", "...", "2012", "2024", "...", "2050"),
       election_age = c("63", "47", "...", "52", "75", "...", "68"),
       sex = c("Female", "Male", "...", "Female", "Female", "...", "Male")) |>
  gt() |>
  tab_header(title = "Preceptor Table") |>
  cols_label(ID = md("ID"),
             lived_after = md("Years Lived After"),
             year_elected = md("Year of Election"),
             election_age = md("Age at Election"),
             sex = md("Sex")) |>
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(ID))) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(ID))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(ID)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Outcome", columns = c(lived_after)) |>
  tab_spanner(label = "Covariates", columns = c(sex, year_elected, election_age))

Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem.

Exercise 14

In your QMD, load the tidyverse and the primer.data packages in a new code chunk. Label it as the setup code chunk by adding #| label: setup. Render the file.

Notice that the file does not look good because the code is visible and there are annoying messages. To take care of this, add #| message: false to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all code echos from the whole file:

execute: 
  echo: false

In the Console, run:

show_file("analysis.qmd", start = -5)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 6)

Render again. Everything looks nice because we have added code to make the file look better and more professional.

Exercise 15

In your own words, define "validity" as we use the term.

question_text(NULL,
    message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.

Exercise 16

Provide one reason why the assumption of validity might not hold for the outcome variable lived_after or for one of the covariates. Use the words "column" or "columns" in your answer.

question_text(NULL,
    message = "In our data, the information about birth and death dates for *losing* candidates isn't nearly as well-documented a it is for *winning* candidates, so the information was gathered from untrustworthy sources.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.

Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.

Exercise 17

Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.

The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. It is not connected to the initial "Imagine that you are someone considering running for governor" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.

The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gather, how many observations it includes and the organization (if famous) which collected it.

Type your two sentences below.

question_text(NULL,
    message = "Using data from candidates running for governor, we analyze variables such as candidate age, sex, and projected longevity in the United States between 2000 and 2050. Our goal is to forecast the candidate's longevity post-election, accounting for health trends, socioeconomic factors, and political tenure. This tibble contains 3,587 observations, providing extensive data for the analysis of post-election life expectancy.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to your QMD, Command/Ctrl + Shift + K, and then commit/push.

Justice

It is in justice that the ordering of society is centered. - Aristotle

Exercise 1

In your own words, name the four key components of Justice for working on a data science problem.

question_text(NULL,
    message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 2

In your own words, define a Population Table.

question_text(NULL,
    message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We have not specified the rows for either the Preceptor Table or the Population Table concerning what sort of elections we are interested in. How about Senate races? Those are similar to gubernatorial elections in that they are state-wide and attract series candidates.

But are there rows for other elections? What about mayorial elections? What about the town library committee? We need to make some choices here and, given that we only have data for gubernatorial elections, we need to be careful. Update:

Using data from all deceased gubernatorial candidates in the United States between 1945 and 2012, we seek to forecast candidate longevity in state-wide US races post-election.

This is a change in our Preceptor Table. We are no longer interested in candidates outside the US, or those running outside of gubernatorial or Senate contests.

Exercise 3

In your own words, define the assumption of "stability" when employed in the context of data science.

question_text(NULL,
    message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Lifespan changes over time. Therefore, our estimates for the future may need some adjustment — that is, to add years to our predicted life expectancy to account for a global change in lifespan over time.

Exercise 4

Provide one reason why the assumption of stability might not be true in this case.

question_text(NULL,
    message = "The average lifespan in 1950 is different than the average in 2020, so our lifespan in the future must be tweaked in regard to this discrepancy.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In fact, between 1960 and 2015, life expectancy for the total population in the United States increased by almost 10 years.

When we are confronted with this uncertainty, we can consider making our timeframe smaller.

The longer the time period covered by the Preceptor Table (and the data), the more suspect the assumption of stability becomes.

Exercise 5

In your own words, define the assumption of "representativeness" when employed in the context of data science.

question_text(NULL,
    message = "Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The more expansive your Preceptor Table, the more important the assumption of representativeness becomes.

This data is highly representative of gubernatorial candidates (state officials), as it includes every candidate from 1945 to 2012.

Exercise 6

Provide one reason why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "Only the two candidates with the most votes are included in the data set. This is unfortunate, as we would ideally look at all gubernatorial candidates (regardless of votes).",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Given this data-implementation, we can either:

1) change the question we are answering (restricting it to just major party candidates) or 2) assume that major party candidates are not systematically different from all candidates.

In this tutorial, we will be assuming Option 2.

Exercise 7

In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.

question_text(NULL,
    message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true.

Exercise 8

Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention at least one specific problem which casts doubt on your approach.

question_text(NULL,
    message = "Using data from all deceased gubernatorial candidates in the United States from elections held between 1945 and 2012, we seek to forecast candidate longevity in state-wide US races post-election. There is concern that longevity for gubernatorial candidates will differ significantly from that for candidates in Senate and other state-wide elections.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Note: Your paragraph doesn't necessarily have to be as long as ours is :)

Edit the summary paragraph in analysis.qmd as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K, and then commit/push.

Courage

Courage is going from failure to failure without losing enthusiasm. - Winston Churchill

Exercise 1

In your own words, describe the components of the virtue of Courage for analyzing data.

question_text(NULL,
    message = "Courage starts with math, explores models, and then creates the data generating mechanism.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.

Exercise 2

Load the tidymodels package.


library(...)
library(tidymodels)

The probability family is determined by the outcome variable lived_after .

If lived_after is a continuous variable, the probability family is Normal, also known as Gaussian.

Exercise 3

Load the gtsummary package.


library(...)
library(gtsummary)

The link function, the basic mathematical structure of the model, is (mostly) determined by the type of outcome variable.

For a continuous outcome variable, we use a linear model for the link function:

$$\mu = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots$$

Exercise 4

Load the marginaleffects package.


library(...)
library(marginaleffects)

We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Exercise 5

Load the equatiomatic package.


library(...)
library(equatiomatic)

This is the basic mathematical structure of our model:

extract_eq(fit_all, 
           intercept = "beta", 
           wrap = TRUE, 
           show_distribution = TRUE)

Of course, in a normal data science project, we would explore a variety of different combinations of independent variables before selecting a final set. Just pretend that we already did that.

Exercise 6

Add library(tidymodels), library(gtsummary), library(equatiomatic), and library(marginaleffects) to the setup code chunk in your QMD. Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("XX.qmd", pattern = "tidymodels|gtsummary|equatiomatic|marginaleffects")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Recall that a categorical variable (whether character or factor) like sex is turned into a $0/1$ "dummy" variable which is then re-named something like $sex_{Male}$. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.

Exercise 7

Because our outcome variable is continuous, start to create the model by using linear_reg(engine = "lm").


linear_reg(engine = "...")
linear_reg(engine = "lm")

The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. If race has three values --- "black", "hispanic", and "white" --- then the model creates two 0/1 dummy variables, giving them names like $race_{hispanic}$ and $race_{white}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.

Note there is no variable like this in our problem.

Exercise 8

Continue the pipe to fit(lived_after ~ election_age, data = df1).


... |> 
  fit(..., data = ...)
linear_reg(engine = "lm") |> 
   fit(lived_after ~ election_age, data = df1)

We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:

extract_eq(fit_all, 
           intercept = "beta", 
           use_coefs = TRUE)

There are three main differences between this representation of the model and our previous one. First, we replace the parameters with our best estimate of their values. Second, the error term is gone. Third, the dependent variable now has a "hat," indicating that it is our "fitted" value, our best guess as to the value of the outcome, given the values of the independent variables for any given unit.

Exercise 9

Behind the scenes of this tutorial, an object called fit_all has been created which is the result of the code above. Type fit_all and hit "Run Code." This generates the same results as using print(fit_all).


fit_all
fit_all

In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.

Exercise 10

Create a new code chunk in your QMD. Add two code chunk options: label: model and cache: true. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_all.

Command/Ctrl + Shift + K. It may take some time to render your QMD, depending on how complex your model is. But, by including cache: true you cause Quarto to cache the results of the chunk. The next time you render your QMD, as long as you have not changed the code, Quarto will just load up the saved fitted object.

To confirm, Command/Ctrl + Shift + K again. It should be quick.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

Add *_cache to .gitignore file. Commit and push. Cached objects are often large. They don't belong on Github.

Exercise 11

Create another code chunk in your QMD. Add the chunk option: label: math. In that code chunk, add something like the below. You may find it useful to add the coef_digits argument to show fewer significant digits after the decimal.

extract_eq(fit_XX, 
           intercept = "beta", 
           use_coefs = TRUE,
           )

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", pattern = "extract")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

When you render your document, this formula will appear.

extract_eq(fit_all, 
           intercept = "beta", 
           use_coefs = TRUE)

This is our data generating mechanism.

Our formula is relating lived_after to election_age, meaning that we are saying we can predict lived_after based off of election_sge, because they are proportional

Exercise 12

Run tidy() on fit_all with the argument conf.int set equal to TRUE. The returns 95% intervals for all the parameters in our model.


tidy(..., conf.int = ...)
tidy(fit_all, conf.int = TRUE)

tidy() is part of the broom package, used to summarize information from a wide variety of models.

Exercise 13

For interactive use, tidy() is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression().

Run tbl_regression(fit_all).


tbl_regression(..)
tbl_regression(fit_all)

See this tutorial for a variety of options for customizing your table.

Exercise 14

Create a new code chunk in your QMD. Add a code chunk option: label: table. Add this code to the code chunk.

tbl_regression(fit_all)

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", pattern = "tbl_regression")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

We can make the table look nice using modify_header(label ~ "...") to add a header.

Like this:

tbl_regression(fit_all) |> modify_header(label ~ "Regression Analysis of Factors Predicting Years Lived After Election")

Exercise 15

Add a sentence to your project summary.

Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a linear function of X [and maybe other covariates]."

Recall the beginning of our version of the summary: "Using data from all deceased gubernatorial candidates in the United States from elections held between 1945 and 2012, we seek to forecast candidate longevity in state-wide US races post-election. There is concern that longevity for gubernatorial candidates will differ significantly from that for candidates in Senate and other state-wide elections."

question_text(NULL,
    message = "We model the number of years that candidates live after the election as a linear function of their age and sex.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K, and then commit/push.

Temperance

Temperance is a tree which as for its root very little contentment, and for its fruit calm and peace.* - Buddha

Exercise 1

In your own words, describe the use of Temperance in data science.

question_text(NULL,
    message = "Temperance uses the data generating mechanism to answer the question with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the question with which we began. We create posteriors for the quantities of interest.

Exercise 2

What is the specific question we are trying to answer?

question_text(NULL,
    message = "How long do political candidates live after the election?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic more broadly.

Exercise 3

Enter this code into the exercise code block and hit "Run Code."

plot_predictions(fit_all, 
                 condition = "election_age")

plot_predictions(fit_all, 
                 condition = ...))
plot_predictions(fit_all,
                 condition = "election_age")

In the plot above, we examine the predicted life expectancy (lived_after) for the age of election.

Exercise 4

We can compare the differences in ages a female vs male would live after being elected at different ages using comparisons().

comparisons(fit_all, newdata = governor)

comparisons(fit_all, newdata = governor)
comparisons(fit_all, newdata = governor)

From this analysis, we can see that males are expected to live significantly fewer years than females after being elected, with an average difference of 11.7 years. This gap is statistically significant (p < 0.001), meaning the result is unlikely due to chance. The confidence interval suggests that the difference could range from 4.7 to 18.6 fewer years for males compared to females, indicating a consistent and substantial disparity in post-election life expectancy between the sexes. These results reflect a meaningful difference in longevity, supported by a z-score of 3.299 and a relatively low standard error of 3.5453, reinforcing the robustness of the finding.

Exercise 5

Using plot_comparisons we can see the difference between life expectancy, and how it changes as election_age increases for both males and females.

plot_comparisons(fit_all,
  variables = "election_age",
condition = "sex")

plot_comparisons(fit_all,
  variables = "election_age",
  condition = "sex") +
  scale_x_discrete(breaks = c(0, 1), labels = c("Male (0)", "Female (1)")) +
  labs(x = "Sex", y = "Predicted Life Expectancy")
# plot_comparisons(fit_all,
#   variables = "election_age",
#   condition = "sex") +
#   scale_x_continuous(breaks = c(0, 1), labels = c("Male (0)", "Female (1)")) +
#   labs(x = "Sex", y = "Predicted Life Expectancy")

From this, we can see that males are expected to live 11.7 fewer years than females after being elected, with an average difference of -11.7 years. This gap is statistically significant (p < 0.001), meaning the result is highly unlikely due to chance. The confidence interval suggests that the difference could range from -18.6 to -4.7 fewer years for males compared to females, indicating a consistent and substantial disparity in post-election life expectancy between the sexes. Additionally, for every one-year increase in election age, the expected post-election lifespan decreases slightly by 0.056 years, though this result is not statistically significant (p = 0.879), indicating that election age does not have a meaningful effect on post-election life expectancy.

Exercise 6

For further analysis, we can create a plot that shows the slope of women and their expected age lived after to mens. Enter this code into your console.

ggplot(fit_all, aes(x = election_age, y = lived_after, color = sex)) + geom_point() + geom_abline(intercept = female_intercept, slope = female_slope, color = "#F8766D", size = 1) + geom_abline(intercept = male_intercept, slope = male_slope, color = "#00BFC4", size = 1) + labs(title = "Interaction Model", subtitle = "Comparing post election lifespan across sex", x = "Average Age at Time of Election", y = "Years Lived Post-Election", color = "Sex") + theme_classic()


ggplot(fit_all, aes(x = election_age, y = lived_after, color = sex)) +
  geom_point() +
  geom_abline(intercept = ...,
              slope = female_slope, 
              color = "#F8766D", 
              size = 1) +
  geom_abline(... = male_intercept,
              slope = male_slope,
              color = "#00BFC4", 
              size = 1) +
  labs(title = "Interaction Model",
       subtitle = "Comparing post election lifespan across sex",
       x = "Average Age at Time of Election", 
       y = "Years Lived Post-Election", 
       color = "Sex") +
 ...()
ggplot(fit_all, aes(x = election_age, y = lived_after, color = sex)) +
  geom_point() +
  geom_abline(intercept = female_intercept,
              slope = female_slope, 
              color = "#F8766D", 
              size = 1) +
  geom_abline(intercept = male_intercept,
              slope = male_slope,
              color = "#00BFC4", 
              size = 1) +
  labs(title = "Interaction Model",
       subtitle = "Comparing post election lifespan across sex",
       x = "Average Age at Time of Election", 
       y = "Years Lived Post-Election", 
       color = "Sex") +
  theme_classic()

Exercise 7

Add library(marginaleffects) to your QMD setup code chunk.

Create a new code chunk. Label it with label: plot. Copy/paste the code which creates your graphic.

Command/Ctrl + Shift + K to ensure that it all works as intended.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Exercise 8

Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)

question_text(NULL,
    message = "The quantity of interest in our problem is lived_after. The estimate of -1.53 represents the effect size for the interaction term election_age:sexMale, indicating how much the outcome is expected to change for a one-unit change in the predictor. The uncertainty around this estimate is captured by the 95% confidence interval, which ranges from -1.53 to -0.07. This range tells us that, with 95% confidence, the true effect lies within this interval. The fact that the confidence interval does not include zero indicates that the effect is statistically significant. Additionally, the p-value of less than 0.001 further supports this, suggesting that there is a very low probability that the observed effect is due to chance. Despite the significance, the confidence interval provides a valuable measure of uncertainty, showing that while the effect is negative, there is some variability in its exact magnitude.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K.

Exercise 9

Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.

question_text(NULL,
    message = "The estimates for the quantities of interest and their associated uncertainty might be wrong due to issues like the issue of validity or because the candidates that did not win the election of less accurate info, becaue they were not documented properly.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 10

Rearrange the material in your QMD so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K to ensure that everything works.

At the Console, run:

tutorial.helpers::show_file("analysis.qmd")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

This is the version of your QMD file which your teacher is most likely to take a close look at.

Exercise 11

Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the resulting url below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Add rsconnect to the .gitignore file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.

Exercise 12

Copy/paste the url to your Github repo.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Summary

This tutorial covered Chapter 9: Five Parameters of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane. This tutorial was created by Sanaka Dash and David Kane.




PPBDS/primer.tutorials documentation built on April 3, 2025, 3:11 p.m.