In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(gt)
library(knitr)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

# Key Data 

gt_obj <- tibble(subject = c("Yao", "Emma", "Cassidy", "Tahmid", "Diego"),
       treatment = c("Treated", "Treated", "Control", "Control", "Treated"), 
       ytreat = c("13", "14", "11", "9", "3"),
       ycontrol = c("9", "11", "6", "12", "4"),
       ydiff = c("? ", "? ", "? ", "? ", "? ")) |>
   gt() |>
  cols_label(subject = md("ID"),
                treatment = md("Treatment"),
                ytreat = md("$$Y_t(u)$$"),
                ycontrol = md("$$Y_c(u)$$"),
                ydiff = md("$$Y_t(u) - Y_c(u)$$")) |>
  cols_move(columns = c(treatment, ytreat, ycontrol), after = c(subject)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(subject, treatment))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(subject)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Outcomes", c(ytreat, ycontrol))  |>
  tab_spanner(label = "$$\\text{Estimand}$$", c(ydiff))


data_table <- tibble(id = c("Robert", "Beau", "Ishan", "Nicholas"),
       height_cm = c("178", "172", "173", "165")) |>
  gt() |>
  cols_label(id = md("ID"),
                height_cm = md("Height (cm)")) |>
  cols_move(columns = height_cm, after = c(id)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(id))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(id)) |>
  fmt_markdown(columns = everything())


precep_1 <- tibble(id = c("Robert", "Andy", "Beau", "Ishan", "Nicholas"),
       height_cm = c("?", "?", "?", "?", "?")) |>
  gt() |>
  cols_label(id = md("ID"),
                height_cm = md("Height (cm)")) |>
  cols_move(columns = height_cm, after = c(id)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(id))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(id)) |>
  fmt_markdown(columns = everything())

precep_table <- tibble(id = c("Robert", "Andy", "Beau", "Ishan", "Nicholas"),
       height_cm = c("178", "?", "172", "173", "165")) |>
  gt() |>
  cols_label(id = md("ID"),
                height_cm = md("Height (cm)")) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(id))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(id)) |>
  fmt_markdown(columns = everything())


precep_table_2 <- tibble(id = c("Robert", "Andy", "Beau", "Ishan", "Nicholas"),
       height_cm = c("178", "172", "172", "173", "165")) |>
  gt() |>
  cols_label(id = md("ID"),
                height_cm = md("Height (cm)")) |>
  cols_move(columns = height_cm, after = c(id)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(id))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(id)) |>
  fmt_markdown(columns = everything())

Introduction

This tutorial covers Chapter 1: Rubin Causal Model of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

This tutorial will review key concepts including the Preceptor Table, the Population Table, potential outcomes, causal effects, validity, stability, representativeness, and unconfoundedness. We assume that your have read the chapter.

Preceptor Table

We would not need data science if we (and our bosses, colleagues, and clients) did not have questions. Every data science project starts with a question.

What is the average height of 5 brothers? We have this data:

data_table

Sadly, we are missing the height for Andy, the fifth brother.

This section will explore the use of a Preceptor Table to answer this question.

Exercise 1

Define a Preceptor Table in your own words.

question_text(NULL,
    message = "A table such that, if none of the data is missing, you can easily calculate your quantity of interest.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

Preceptor Tables vary in the number of their rows and columns. We use question marks to indicate missing data in a Preceptor Table.

Exercise 2

Describe the units and outcomes in the Preceptor Table which would allow us to answer our question.

question_text(NULL,
    message = "The units are the 5 brothers. So, we need a row for each brother. The outcome is height, measured in centimeters. So, we need a column for height.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The best way to construct a Preceptor Table is to begin by answering questions about units, outcomes and treatments. In this case, there are no treatments, so we only need to specify the units and the outcome.

Exercise 3

Describe what the Preceptor Table should look like for this example.

question_text(NULL,
    message = "The Preceptor Table will have a row for each brother. In addition to an ID column, it will have one other column: height, which is the outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table should look like:

precep_1

Question marks are used to indicate missing data. It is tempting to "plug in" the data that we have for four of the brothers. For example, don't we "know" that Robert is 178 centimeters tall? Can't we just replace "?" with 178 in our Preceptor Table?

Exercise 4

Implicitly in every Preceptor Table is a notion of time. Describe a scenario in which a time difference between data-collection and question-answering might matter.

question_text(NULL,
    message = "What if the brothers are children and the data were collected in 2020? But it is now 2024. Do we really know that, for example, Robert's current height is 178? No! We don't. This (potentially) makes answering our question much more difficult.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In other words, there really should be a Time column in both the data table:

data_table |> 
  cols_add(Year = rep(2020, 4), .after = id)

And in the Preceptor Table:

precep_1 |> 
  cols_add(Year = rep(2024, 5), .after = id)

Exercise 5

Describe in words our Preceptor Table if we assume that the height of the four brothers has not changed since 2020.

question_text(NULL,
    message = "The Preceptor Table is the same as before. It has five rows, one for each brother, and a column for height. But, instead of 5 question marks for missing data, it now only has one, for Andy.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

It is so common to, implicitly, assume that nothing has changed between the data collection period and now --- or to whatever time period the question refers --- that we often go straight to a Preceptor Table which includes the data that we have.

precep_table

Exercise 6

Describe in your own words the assumption of validity.

question_text(NULL,
    message = "Validity is the consistency, or lack thereof, in the columns of your dataset and the corresponding columns in your Preceptor Table. In order to consider the two datasets as being drawn from the same population, the columns from one must have a valid correspondence with the columns in the other.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

We hope that all this new terminology is not overwhelming. The "Cardinal Virtues" chapter in the Primer provides an overview of the most important terms and their definitions.

Exercise 7

Come up with a scenario for this problem in which the assumption of validity might not hold.

question_text(NULL,
    message = "There are many reasonable answers to this question. What if the data were collected while the brothers had their shoes on but the question about average height refers to height without shoes? In that case, the meaning of 'height' would be different between the data and the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Much of what a good data scientist does is to try to think of reasons why questions can not be answered. Assumptions are rarely fully true. We need to test them. The best way to start testing is to come up with (plausible) objections. We will practice this skill over and over again.

Exercise 8

Describe in your own words what the assumption of validity, if it is true, allows us to do. Use the verb "stack" and the phrase "Population Table" in your answer.

question_text(NULL,
    message = "If the assumption of validity holds, we can 'stack' the rows from the data and from the Preceptor Table into the same table. This structure forms the start of the Population Table, although it will generally have many more rows.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

x_1 <- tibble(id = c("Robert", "Beau", "Ishan", "Nicholas"),
              year = 2020,
              height_cm = c("178", "172", "173", "165"))

x_2 <- tibble(id = c("Robert", "Andy", "Beau", "Ishan", "Nicholas"),
            year = 2024,
            height_cm = c("?", "?", "?", "?", "?"))

rbind(x_1, x_2) |> 
  mutate(source = c(rep("Data", 4), rep("Preceptor Table", 5))) |> 
  select(source, year, id, height_cm) |> 
  gt() |>
  cols_label(source = md("Source"),
             id = md("ID"),
             year = md("Year"),
              height_cm = md("Height (cm)")) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(id))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(id)) |>
  fmt_markdown(columns = everything())

This object is not (yet) a Population Table. It is the step between having two tables (the data and the Preceptor Table) and then, given the assumption of validity, creating the full Population Table.

The key addition we still need is more rows. For example, there is no row for Andy in 2020 because we did not collect data for Andy in that year. But Andy did exist! We could, in theory, have measured his height and recorded it, along with the height of his brothers. The Population Table, because it includes a row for every unit/time in the population, will include a row for Andy in 2020, even though the data will be missing.

Exercise 9

In your own words, describe the process by which we can go from the combined table in the previous Exercise to the "informal" Preceptor Table with which we began:

precep_table

In particular, what do we need to assume to use, for example, Robert's measured height in 2020 to fill in Robert's height in the Preceptor Table in 2024?

question_text(NULL,
    message = "The key assumption is that height is constant, at least for the 4 years we care about and for these men. That is, although we have not measured Robert's height in 2024, we assume that it is the same as the height we measured in 2020: 178 centimeters.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The informal Preceptor Table does not include either the Source or the Time column. (Even the ID column is not really necessary. Knowing that the missing measurement is for someone named "Andy" does not help us to estimate the overall average. The only thing we really need is the vector of 4 heights which we do know.)

Population Table

The Population table serves to show the overall population we are interested in. It combines the data from the Preceptor Table and our data set. There are three sources of data for the Population Table: units we want (Preceptor Table), units we have already (the data), and the other units (the rest of the population which is neither in the data nor in the Preceptor Table).

Exercise 1

In your own words, give a one-sentence definition of a Population Table.

question_text(NULL,
    message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn. It can be constructed if the validity assumption is (mostly) true. It includes all the rows from the Preceptor Table. It also includes the rows from the data set. It usually has other rows as well, rows which represent unit/time combinations from other parts of the population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

Our Population rows often contain no data. These are subjects that fall within our population, but for which we have no data. As such, all values are missing.

Exercise 2

Our Preceptor Table had two columns: ID and height. Our data table has the same two columns. What other columns are usually added to create a Population Table?

question_text(NULL,
    message = "Population Tables also include a `Source` and `Time` columns.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Source column indicates the source of the row. The Time column, which is often named with the relevant unit of time (like year or date or whatever), indicates the moment in time to which the data in the row refers.

Exercise 3

What are the three sources for the rows in the Population Table? How are these rows usually indicated in the Source column?

question_text(NULL,
    message = "The three sources for the rows in the Population Table are the data, the Preceptor Table, and the other units in the population from which the data and the Preceptor Table were drawn. These rows are usually indicated with 'Data', 'Preceptor Table', and '...', respectively.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In causal usage, we will often leave out the Source or Time columns from the Population Table, but they are always there, at least implicitly.

Exercise 4

What combination of variables uniquely defines each row in the Population Table?

question_text(NULL,
    message = "Each row in the Population Table is uniquely defined by a ID/Time combination.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Beau, along with the other brothers, will appear multiple times in the Population Table. The Time 2024 will appear multiple times, but there will only be one row for Beau in 2024.

Exercise 5

In your own words, explain why "time is always a lie" in a Population Table.

question_text(NULL,
    message = "Time variables in data science, particularly in Population Tables, can be misleading due to inaccurate measurements, hidden variations, and lack of specificity. The value for the time variable in rows corresponding to the Preceptor Table is often ambiguous, as it may refer to the present, future, or whenever the analysis is completed.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The purpose of the Population Table is to force us to think clearly about the questions we are trying to answer. The time variable, and the problems therein, are a major part of that task.

Exercise 6

For simplicity, let's drop the ID column. Describe in words what the Population Table for this problem should look like.

question_text(NULL,
    message = "There will be 3 columns: `Source`, `Year`, and `Height`. There will be rows for every row from the data table and for every row in the Preceptor Table. In addition, there will be rows from the broader population, perhaps going back in time to 2010 and forward in time to 2040.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

It is OK if your description is somewhat different from ours. Bordering dates like 2010 and 2040 are certainly arbitrary. However, given that we are interested in the heights of one specific family, it would be weird if the date range was too broad. None of the brothers were even born before 1900, for example. Whatever words you use, in your mind you should be picturing something like this:

tibble(subject = c("...", "...", "Data", "Data","...", "...", "...",  "...", 
"Preceptor Table", "Preceptor Table",  "...",  "..."),
       year = c("2010", "...", "2015", "2015",  "...", "2020", "2020", "...", "2024", "2024",  "...", "2040"),
       height = c("?", "...", "172", "180", "...", "?", "?", "...","?", "?", "...", "?")) |>
  gt() |>
  cols_label(subject = md("Source"),
                year = md("Year"),
                height = md("Height")) |>
  cols_move(columns = year, after = c(subject)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(subject))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(subject)) |>
  fmt_markdown(columns = everything())

Most Population Tables won't have as many ... rows as this one. They are always there. We just leave them out for convenience. The ... rows indicate that there are many units with Year 2010 (and 2011 and 2012 and ...), just as there are many rows between 2015 and 2024. The population is usually much larger than either the data or the Preceptor Table.

Causal Effect

Use the following "imaginary" Preceptor Table to answer some questions about "estimands", and the quantities in which we might be interested.

gt_obj

This Preceptor Table is "imaginary" because we can never know, for example, Yao's outcome under treatment and outcome under control. We only get to see one of these.

Exercise 1

In your own words, give a one-sentence definition of a causal effect.

question_text(NULL,
    message = "A causal effect is the difference between two potential outcomes.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” The difference between the potential outcome under treatment and the potential outcome under control is a “causal effect” or a “treatment effect.”

Exercise 2

quiz(question("What is the causal effect of the treatment on Yao?",
    answer("1"),
    answer("2"),
    answer("3"),
    answer("4", correct = TRUE),
    answer("5"),
    allow_retry = FALSE))

According to the RCM, the causal effect of being on the platform with treatment is the difference between what your attitude would have been under “treatment” (with treatment) and under “control” (no treatment).

Exercise 3

quiz(question("For how many of the five people is the causal effect of the treatment positive?",
    answer("1"),
    answer("2"),
    answer("3", correct = TRUE),
    answer("4"),
    answer("5"),
    allow_retry = FALSE))

To calculate the causal effect, we need to compare the outcome for an individual in one possible state of the world (with treatment) to the outcome for that same individual in another state of the world (without treatment).

Exercise 4

quiz(question("On whom did the treatment have the most negative causal effect?",
    answer("Yao"),
    answer("Emma"),
    answer("Cassidy"),
    answer("Tahmid", correct = TRUE),
    answer("Diego"),
    allow_retry = FALSE))

We will use the symbol $$Y$$ to represent potential outcomes, the variable we are interested in understanding and modeling. $$Y$$ is called the response or outcome variable. It is the variable we want to “explain.” In our case this would be the attitude score. If we are trying to understand a causal effect, we need two symbols so that control and treated values can be represented separately: $$Y_t(u)$$ and $$Y_c(u)$$.

Exercise 5

In these examples, we did not have to deal with the Fundamental Problem of Causal Inference because we knew the outcomes for both treatment and control. Write a paragraph explaining the Fundamental Problem of Causal Inference. Include an example relating to your own life.

question_text(NULL,
    message = "The Fundamental Problem of Causal Inference is that, because we can never observe more than one potential outcome, we can never be certain about the value of a causal effect. For example, to determine the causal effect of studying on my SAT score, I need to know two things: my score if I study and my score if I don't study. The causal effect is the difference between the two. However, I can only either study or not study. I can't do both! So, I can only observe one potential outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 7)

The Rubin Causal Model (RCM) is based on the idea of potential outcomes. However, it is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing, since a unit cannot travel back in time, and experience both treatments. This dilemma is the Fundamental Problem of Causal Inference.

Exercise 6

In this tutorial, we have used two examples. The first involved the height of 5 brothers. The second involved attitudes toward immigration. As discussed in the Primer, there are two types of models: Causal and Predictive.

In one sentence, explain which type of model the brother-height example involves and why.

question_text(NULL,
    message = "The brother-height problem involves the creation of a predictive model because there is only one outcome: height. ",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In a predictive model, there is only one value under consideration for each unit. There is only one outcome.

Exercise 7

In one sentence, explain which type of model the immigration-attitudes example is and why.

question_text(NULL,
    message = "The immigration-attitude example involves a causal model because there are two potential outcomes: a person's attitude toward immigration if they hear Spanish on the train platform and the same person's attitude if they do not.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

In a causal model, there are two (or more) potential outcomes. Determining whether or not a model is causal --- a determination that can only be made by considering the original question --- is one of the first steps in constructing the Preceptor Table.

Exercise 8

In your own words, write two sentences that explain the meaning of the phrase "No causation without manipulation."

question_text(NULL,
    message = "The motto 'No causation without manipulation' suggests that for a causal effect to be well-defined, both potential outcomes must be possible, at least conceptually. However, this raises complex questions about whether characteristics like race, sex, and genetic conditions can be considered causal, as they may not be easily manipulated.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This motto is most likely to come up in initial discussions with your colleague/boss/client. Many people lack intuition for how to think about problems that appear causal but which, on closer inspection, are really predictive. The motto helps focus ideas.

Assumptions

This section reviews the key assumptions required in every attempt at inference: validity, stability, representativeness, and unconfoundedness.

Exercise 1

Write two sentences about validity, as used as an assumption in the Primer.

question_text(NULL,
    message = "The assumption of validity allows us to ignore any variation in treatment or in any other variable. It allows us to 'stack' the rows from the data, the Preceptor Table, and the population into a single Population Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Each of our two sources for rows --- the Preceptor Table and the actual data --- is coherent in terms of what the columns mean on their own. The assumption of validity, if true, allows us to "stack" them together and to consider both of them to have been drawn from the same larger population.

Exercise 2

Write two sentences about stability, as used as an assumption in the Primer.

question_text(NULL,
    message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Validity is about the columns. For example, does "exposure to Spanish speakers" mean the same thing in our data from 2014 as it does in the data we want to have --- for our Preceptor Table --- in 2024? Stability is about the rows. Is the connection between the treatment and the potential outcomes the same in 2024 as it was in 2014?

Exercise 3

In your own words, give a one sentence definition of a potential outcome.

question_text(NULL,
    message = "An outcome which occurs in the case in which the treatment has the specified value.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

The phrase "potential outcome" by itself is not well-defined. There is no singular "potential outcome." The very definition of potential outcome includes a treatment of a certain value. Your lifespan, by itself, is not a potential outcome. Your lifespan if you exercise is a potential outcome. Your lifespan if you don't exercise is a potential outcome. Those will be different numbers. The causal effect of exercise on lifespan is the difference between those two potential outcomes.

Exercise 4

Write two sentences about representativeness, as used as an assumption in the Primer.

question_text(NULL,
    message = "Representativeness, or the lack thereof, concerns two relationships, among the rows in the Population Table: the first between the Preceptor Table and the other rows, and the second between our data and the other rows. Ideally, we would like both the Preceptor Table and our data to be random samples from the population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Why do we have the data that we have? The easiest way to draw inferences about other rows, especially the rows in the Preceptor Table subpart of the Population Table, is if our data is a random draw from the entire population. This is almost never true. How untrue it is determines how much we need to worry about the representativeness of our data.

Exercise 5

Write two sentences about unconfoundedness, as used as an assumption in the Primer.

question_text(NULL,
    message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates. This assumption is only relevant for causal models. We describe a model as “confounded” if this is not true.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

"Unconfoundedness" as an assumption only applies in the case of causal, not predictive, models. The best way to ensure unconfoundedness is to randomize the treatment across units so that we can estimate the average treatment effect by subtracting the average outcome for control units from the average outcome for treated units, as we do above.

Exercise 6

Write two sentences that explain the difference between a causal model and a predictive model.

question_text(NULL,
    message = "A predictive model assumes one outcome whereas a causal model allows for more than one outcome, which we term 'potential' outcomes in that case. In the Preceptor Table, a predictive model has only one outcome column while a causal model has at least two.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

With a predictive model, we cannot infer what would happen to the outcome $Y$ if we change $X$ for a given unit. We can only compare two units, one with one value of $X$ and another with a different value of $X$.

Simple models

In these exercises, we will assume that the causal effect, $\tau$ (pronounced tau, rhymes with "cow"), is the same for everyone.

Use the following Preceptor Table to answer questions about the single value for tau.

tibble(subject = c("Yao", "Emma", "Cassidy", "Tahmid", "Diego"),
       treatment = c("Treated", "Treated", "Control", "Control", "Treated"),
       ytreat = c("13", "14", "?", "?", "3"),
       ycontrol = c("?", "?", "6", "12", "?"),
       ydiff = c("?", "?", "?", "?", "?")) |>
  gt() |>
  cols_label(subject = md("ID"),
                treatment = md("Treatment"),
                ytreat = md("$$Y_t(u)$$"),
                ycontrol = md("$$Y_c(u)$$"),
                ydiff = md("$$Y_t(u) - Y_c(u)$$")) |>
  cols_move(columns = c(treatment, ytreat, ycontrol), after = c(subject)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(subject))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(subject)) |>
  tab_spanner(label = "Outcomes", c(ytreat, ycontrol)) |>
  tab_spanner(label = "$$\\text{Estimand}$$", c(ydiff)) |>
  fmt_markdown(columns = everything())

The question marks in the Outcomes are the missing values that we can never know because of the Fundamental Problem of Causal Inference. Since you cannot use simple arithmetic to calculate the causal effect, an Estimand is not the value you calculated, but is rather the unknown variable you want to estimate.

Exercise 1

Describe in one sentence/equation how you would estimate Yao's $Y_c(u)$. (Do not use actual numbers, use "tau" in your explanation).

question_text(NULL,
    message = "Yao's $Y_t(u)$ - (Sum of all values in $Y_t(u)$ / Number of Values - Sum of all values in $Y_c(u)$ / Number of Values)",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

By definition, $Y_t(u)$ - $Y_c(u)$ = $\tau$, using simple algebra, it is clear that $Y_c(u)$ = $Y_t(u)$ - $\tau$

Exercise 2

Describe in one sentence/equation how you would estimate Tahmids's $Y_t(u)$. (Do not use actual numbers, use "tau" in your explanation).

question_text(NULL,
    message = "(Sum of values in $Y_t(u)$ / Number of Values - Sum of values in $Y_c(u)$ / Number of values) + Tahmid's $Y_c(u)$",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

Using the same logic as above, $Y_t(u)$ = $Y_c(u)$ + $\tau$

One model might be that the causal effect is the same for everyone. There is a single parameter, τ, which we then estimate. Once we have an estimate, we can fill in the Preceptor Table because, knowing it, we can estimate what the unobserved potential outcome is for each person. We use our assumption about τ to estimate the counterfactual outcome for each unit.

Exercise 3

Describe in one sentence/equation how you would estimate a single value for tau.

question_text(NULL,
    message = "(Sum of values in $Y_t(u)$ / Number of Values - Sum of values in $Y_c(u)$ / Number of values)",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

Exercise 4

quiz(

  question("What is your estimate for a single value for tau?",
    answer("-2"),
    answer("2"),
    answer("1", correct = TRUE),
    answer("-1"),
    answer("1.5"),
    allow_retry = FALSE),

  question("What is your estimate for $Y_c(u)$ for Emma?",
    answer("10"),
    answer("11"),
    answer("12"),
    answer("13", correct = TRUE),
    answer("14"),
    allow_retry = FALSE),

  question("What is your estimate for $Y_t(u)$ for Cassidy?",
    answer("6"),
    answer("5"),
    answer("7", correct = TRUE),
    answer("9"),
    answer("8"),
    allow_retry = FALSE)
)

Once we have an estimate for tau, we could add it to the observed value of every observation in the control group (or subtract it from the observed value of every observation in the treatment group) and thus fill in all the missing values.

Exercise 5

Assume that the causal effect varies by sex. We will now estimate two values for $\tau$: $\tau_F$ and $\tau_M$. (Cassidy and Emma are female; Tahmid, Diego, and Yao are male).

tibble(subject = c("Yao", "Emma", "Cassidy", "Tahmid", "Diego"),
       treatment = c("Treated", "Treated", "Control", "Control", "Treated"),
       ytreat = c("13", "14", "$$6 + \\tau_F$$", "$$12 + \\tau_M$$", "3"),
       ycontrol = c("$$13 - \\tau_M$$", "$$14 - \\tau_F$$", "6", "12", "$$3 - \\tau_M$$"),
       ydiff = c("$$\\tau_M$$", "$$\\tau_F$$", "$$\\tau_F$$", "$$\\tau_M$$", "$$\\tau_M$$")) |>
  gt() |>
  cols_label(subject = md("ID"),
                treatment = md("Treatment"),
                ytreat = md("$$Y_t(u)$$"),
                ycontrol = md("$$Y_c(u)$$"),
                ydiff = md("$$Y_t(u) - Y_c(u)$$")) |>
  cols_move(columns = c(treatment, ytreat, ycontrol), after = c(subject)) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), 
            locations = cells_column_labels(columns = c(subject))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(subject)) |>
  tab_spanner(label = "Outcomes", c(ytreat, ycontrol)) |>
  tab_spanner(label = "$$\\text{Estimand}$$", c(ydiff)) |>
  fmt_markdown(columns = everything())

How would you calculate $\tau_F$? Use only words and no numbers in your explanation.

question_text(NULL,
    message = "(Average $Y_t(u)$ for females - Average $Y_c(u)$ for females)",
    # AJ: You would average $Y_c(u)$ for females and subtract that from the average $Y_t(u)$ for females
    # AJ: which is better?
    # TJ: I like the first one.
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

A second model used by two taus might assume that the causal effect is different between levels of a category but the same within those levels. For example, perhaps there is a τf for females and τm for males.

Exercise 6

What is the meaning of $\tau_M$?

question_text(NULL,
    message = "$t_M$ is an estimate of the average difference between the treatment effect and the control effect for males",
# AJ: This explanation could be made more clear and also needs to be added to the primer
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

When we are looking at a “category” of units — for instance, sex — we call this a covariate. Possible covariates include, but are not limited to, sex, age, political party, and almost everything else that might be associated with an individual unit.

Exercise 7

What is your new estimate for Diego's $Y_c(u)$?

question_text(NULL,
    message = "7",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

Exercise 8

Discuss for a sentence or two why an assumption that the causal effect varies by sex leads to a different estimate for Diego's $Y_c(u)$ compared to Cassidy's $Y_c(u)$.

question_text(NULL,
    message = "Since Diego is male and Cassidy is female, the tau used to calculate the causal effect is different for each.",
    # AJ: Add this into the primer
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

This difference in estimates highlights the difficulties of inference. Models drive inference. Different models will produce different inferences.

Exercise 9

We will no longer make any assumptions about $\tau$ for any individual or group. Instead, we are interested in estimating the average treatment effect ($ATE$). We have the same data as the previous sections.

gt_obj

Using words only, explain how we estimate the $ATE$.

question_text(NULL,
    message = "Average of treated values minus the average of control values.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

The average treatment effect (ATE) is the average difference in potential outcomes between the treated group and the control group. Because averaging is a linear operator, the average difference is the same as the difference between the averages. It is particularly useful because we don’t have to assume anything about each individual’s τ, like τyao, but can still understand something about the average causal effect across the whole population.

Exercise 10

Calculate the $ATE$ based on the data given to you in the above Preceptor Table.

question_text(NULL,
    message = "1.6",
# AJ: think of a way to make this a non decimal answer without screwing up the other questions
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

Exercise 11

What is Cassidy's outcome under treatment if we assume $\tau$ to be the $ATE$ we calculated above? Note that the answer will just be a number, without any symbol.

question_text(NULL,
    message = "9.4",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

Exercise 12

Write a paragraph about the many, many reasons why $ATE$ may be a bad estimate of the true average treatment effect.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 7)

Consider an example when the treatment effect does vary depending on sex. For males, there is a small negative effect (-4), but for females, there is a larger positive effect (+8). However, the average treatment effect for the whole sample, even if you estimate it correctly, will be a single positive number (+1) – since the positive effect for females is larger than the negative effect for males.

Exercise 13

Write a paragraph about what a heterogeneous treatment effect is and the situations when it is more or less common.

question_text(NULL,
    message = "A heterogeneous treatment effect means that the effect of the treatment varies from individual to individual. A situation where this would be common is when testing drugs. Most people will have a different reaction to the drug, so we can't just assume that the causal effect of the drug is the same for everyone.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 7)

Summary

This tutorial covered Chapter 1: Rubin Causal Model of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

This tutorial reviewed key concepts including the Preceptor Table, the Population Table, potential outcomes, causal effect, validity, stability, representativeness, and unconfoundedness. For further clarification of these concepts view Key Concepts.

PPBDS/primer.tutorials documentation built on April 3, 2025, 3:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PPBDS/primer.tutorials Tutorials for Preceptor's Primer for Bayesian Data Science

In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

Introduction

Preceptor Table

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Population Table

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Causal Effect

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Assumptions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Simple models

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Summary

R Package Documentation

Browse R Packages

We want your feedback!

PPBDS/primer.tutorials
Tutorials for Preceptor's Primer for Bayesian Data Science