library(learnr) library(tutorial.helpers) library(gt) library(tidyverse) library(primer.data) library(tidymodels) library(broom) library(equatiomatic) library(marginaleffects) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") fit_att <- linear_reg(engine = "lm") |> fit(att_end ~ treatment, data = trains) # tidy(fit_att, conf.int = TRUE) # extract_eq(fit_att, intercept = "beta") # extract_eq(fit_att, intercept = "beta", use_coefs = TRUE) # # plot_predictions(fit_att, # condition = "treatment")
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
The world confronts us. Make decisions we must.
Imagine that you are you are a campaign manager for a Republican Congressional candidate. Your goal is to elect your candidate. Your candidate is conservative about immigration. You know that there more conservative a voter is about immigration, the more likely she is to vote for your boss. How should you spend your campaign funds to increase your odds of winning the election?
Load the tidyverse package.
library(...)
library(tidyverse)
The trains
tibble measures attitudes toward immigration among Boston commuters. Individuals were exposed to one of two possible conditions, and then their attitudes towards immigrants were recorded. One condition was waiting on a train platform near individuals speaking Spanish. The other was being on a train platform without Spanish-speakers.
Load the primer.data package.
library(...)
library(primer.data)
The trains
tibble is available in primer.data.
After loading primer.data in your Console, type ?trains
in the Console, and paste in the Description below.
question_text(NULL, message = NULL, answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The data we have was collected from commuters on nine train platforms around Boston, Massachusetts in 2012. The data was for attitudes toward immigration-related policies, both before and after an experiment which randomly exposed a treated group to Spanish-speakers on a Boston commuter train platform.
Attitude toward immigration is the broad topic of this tutorial. Given that topic, which variable in trains
should we use as our outcome variable?
question_text(NULL, message = "att_end", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
att_end
will be our outcome variable, and we will compare it using treatment.
trains |> ggplot(aes(x = att_end, fill = treatment)) + geom_bar(aes(y = after_stat(count/sum(count))), position = "dodge") + labs(title = "Ending Attitude Toward Immigration", subtitle = "Treated Individuals Are More Conservative", x = "Attitude", y = "Probability", fill = NULL) + scale_y_continuous(labels = scales::percent_format()) + theme_classic()
We have our treatment, that is binary, meaning that it only takes on one of two values. In theory the treatment should always be able to be manipulated. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.
Describe the treatment variable and how might we manipulate its value?
question_text(NULL, message = "The treatment is either 1, treated, or 0, the control. We can manipulate this variable at least in theory, by applying the treatment to the treatment group, and withholding it from the control group.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
Given our choice of exposure to Spanish speakers as the treatment variable, how many potential outcomes are there for each person? Explain why.
question_text(NULL, message = "There are 2 potential outcomes because the treatment variable exposure to Spanish speakers takes on 2 possible values: exposure to Spanish-speakers on a train paltform versus no such exposure.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any data set can be used to construct a predictive model.
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself. At the end, it depends what question is being posed, and what you want to know.
Write a sentence which speculates as to value of the 2 different potential outcomes which we might observe in att_end
for each person when we change the value of the treatment variable treatment
.
question_text(NULL, message = "When the treatment is applied, we speculate the person will have a lower number for att_end compared to when no treatment is applied. For example, if the value of the treatment variable is 1, a person will have a value of 4 for att_end, and if the value of treatment is 0 they will have a value of 8 for att_end.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The point of the Rubin Causal Models is that the definition of a causal effect is the difference between potential outcomes. So, there must be two (or more) potential outcomes for any causal model to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes. But, if the treatment variable is continuous, (like income) then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
Write a few sentences which specify two different values for the treatment variable, for a single unit, and then guesses at the potential outcomes which would result, and then calculates the causal effect for that unit given those guesses.
question_text(NULL, message = "For a given person, assume that the value of the treatment variables might be exposure or no exposure. If the person gets exposure to spanish speakers, then att_end would be 10. If the person gets no exposure, then att_end would be 8. The causal effect on the outcome of a treatment of exposure versus no exposure is 10 - 8 --- i.e., the difference between two potential outcomes --- which equals 3, which is the causal effect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The definition of a causal effect is the difference between two potential outcomes. But its not like first grade subtraction, where we subtract the smaller number from the bigger number. Usually, default is where the causal effect is defined as treatment minus control.
Any causal connection means exploring the difference between two potential outcomes. We can look at just on row and have a causal connection between both outcomes, but generally the average of both outcomes are used. In any case, its impossible to get the causal effect of just one row without averaging everything out and estimating what goes in one column because of the Fundamental Problem of Causal Interference.
Let's consider a predictive model. Which variable in trains
do you think might have an important connection to att_end
?
question_text(NULL, message = "The key covariates can be sex, race, treatment, party.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The key point is that, with a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be treatment variables. We assuming that all covariates are "fixed."
Different key covariates will cause difference in avgs of outcomes. Even if the covariate is not a treatment, we can add it into the model because we see a greater difference with it than without it.
question_text(NULL, message = "Some people might have a value for party of Republican. Others might have a value of Democrat. Those two groups will, on average, have different values for the outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.
Write a causal question which connects the outcome variable att_end
to treatment
, the covariate of interest.
question_text(NULL, message = "What is the causal effect of exposure to Spanish-speakers on attitudes toward immigration?", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
You can only use causal language --- like "affect," "influence," "be associated with," "cause,", "causal effect," et cetera --- in your question if you are creating a causal model, one with a treatment variable which you might, at least in theory, manipulate and with at least two potential outcomes.
With a predictive model, your question should focus on a comparison between different rows, or groups of rows, in the Preceptor Table.
What is a Quantity of Interest which might help us to answer our question?
question_text(NULL, message = "Difference between the attitudes towards immigration of the treated group versus control group at the end of the experiment (the treatment effect).", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.
You will almost always calculate a posterior probability distribution for your Quantity of Interest since, in the real world, you will never know your QoI precisely.
This is the question we will be answering:
What is the effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
To answer that specific question, here is the quantity of interest that we will estimate:
Difference between the attitudes towards immigration of the treated group versus control group at the end of the experiment (the treatment effect).
Once we have our specific question, we can start with the Cardinal Virtues.
Wisdom begins in wonder. - Plato
We begin our data science project with a general question. Imagine we are designing an immigration policy for the government, we wonder how people, those who affected most by the policy, would respond. In other words,
We are interested in the attitudes of people toward immigration.
We decide to narrow down our question to a specific location, let's say adult people in Chicago, Illinois. We wonder how these people would feel when being near foreigners, let's say people who only speak Spanish.
What is the effect, of exposing people to Spanish-speakers, on their attitudes toward immigration?
In your own words, describe the key components of Wisdom for working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If it is not valid to consider the data you have and the (theoretical) data from the Preceptor Table to have arisen out of the same population, your attempt to estimate your quantity of interest ends at the first stage.
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, it is easy to calculate the quantities of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table does not include all the covariates which you will eventually including in your model. It only includes covariates which you need to answer your question.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
As we are considering causal model, our Preceptor Table will have two columns for the potential outcomes. In any causal model, there is at least one covariate which is defined as the “treatment,” something which we can manipulate, at least in theory, so that some units receive one version and other units get a different version.
What are the units for this problem?
question_text(NULL, message = "Our units for this scenario would be individuals because the questions are about the attributes of unique people at the station. The question does not specify which individuals we are interested in, so assume it is adults in Chicago.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Note that the units and the quantity of interest are two different things. Quantity of interest is the number we want to estimate to answer our question, which is the effect of exposing people to Spanish-speakers on attitude toward immigration.
What is/are the outcome/outcomes for this problem?
question_text(NULL, message = "A person’s attitude toward immigration is the outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A person's attitude toward immigration is measured based on the three questions, each measuring agreement on a 1 to 5 integer scale, with 1 being liberal and 5 being conservative. For each person, the three answers were summed, generating an overall measure of attitude toward immigration which ranged from 3 (very liberal) to 15 (very conservative).
What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?
question_text(NULL, message = "Possible covariates include, but are not limited to, sex, age, political party and almost everything else which might be associated with attitudes toward immigration.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of covariates in the data which we end up using in the model.
What are the treatments, if any, for this problem?
question_text(NULL, message = "In this case, the treatment is exposure to Spanish-speakers. Units can either be exposed, i.e., they receive the 'treatment', or they can not be exposed, i.e., they receive the 'control'.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In any causal model, there is at least one covariate which is defined as the “treatment,” something which we can manipulate, in theory, so that some units receive one version and other units get a different version. A “treatment” is just a covariate which we could manipulate, at least in theory.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "We are interested in the causal effect today, in the year 2024.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The notion of time is important, both in our Preceptor Table and in our data. Our data comes from some point in the past, even if it was collected yesterday or just minutes prior, while our questions usually refer to now or to an indeterminate moment in the future.
Define causal effect.
question_text(NULL, message = "A causal effect is the difference between two potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In most circumstances, we are interested in comparing two experimental manipulations, one generally termed “treatment” and the other “control.” According to the Rubin Causal Model (RCM), the causal effect of being on the platform with Spanish-speakers is the difference between what your attitude would have been under “treatment” (with Spanish-speakers) and under “control” (no Spanish-speakers).
What is the Fundamental Problem of Causal Inference?
question_text(NULL, message = "The fundamental problem of causal inference is that we can only observe one potential outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In our experiment, it is impossible to observe both potential outcomes at once. One of the potential outcomes is always missing, since a person cannot travel back in time, and experience both treatments.
How does the motto "No causal inference without manipulation." apply in this problem?
question_text(NULL, message = "The causal effect of exposure to Spanish-speakers is well defined because it is the simple difference of two potential outcomes, both of which might happen. In this case, we (or something else) can manipulate the world, so that it is possible for us to measure the different outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The question of what can and cannot be manipulated are very complex when it comes to race, sex, or genetics and should be considered with care. For instance, we cannot increase a person's height so it makes no sense to investigate the causal effect of height on weight, hence the slogan: No Causation without manipulation.
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "The Preceptor Table has one row for every adult in Chicago in 2024, two columns for people's attitude towards immigration when exposed to Spanish-speakers and when not, and one column indicating whether they belong to 'treatment' or 'control' group.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
#| echo: false tibble(ID = c("1", "2", "...", "10", "11", "...", "N"), attitude_after_control = c("5*", "7", "...", "3*", "10*", "...", "6"), attitude_after_treated = c("8", "4*", "...", "5", "7", "...", "13*"), treatment = c("Yes", "No", "...", "Yes", "Yes", "...", "No")) |> gt() |> tab_header(title = "Preceptor Table") |> cols_label(ID = md("ID"), attitude_after_control = md("Control Ending Attitude"), attitude_after_treated = md("Treated Ending Attitude"), treatment = md("Treatment")) |> tab_style(cell_borders(sides = "right"), location = cells_body(columns = c(ID))) |> tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), locations = cells_column_labels(columns = c(ID))) |> cols_align(align = "center", columns = everything()) |> cols_align(align = "left", columns = c(ID)) |> fmt_markdown(columns = everything()) |> tab_spanner(label = "Potential Outcomes", columns = c(attitude_after_control, attitude_after_treated)) |> tab_spanner(label = "Covariate", columns = c(treatment))
Write one sentence describing the data you have to answer your question.
question_text(NULL, message = "The data include information about each respondent’s sex, political affiliations, age, income and so on. 'treatment' indicates whether a subject was in the control or treatment group. The key outcome is their attitude toward immigration after the experiment: 'att_end'.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem. For example, at the start, we aren't sure what right-hand side variables will be included in the model, so we are not yet sure which covariates must be in the Preceptor Table.
Let's practice creating and publishing a quarto document to answer the question we have in a professional way. Create a Github repo called causal-effect
. Make sure to click the "Add a README file" check box. Copy/paste the URL for its Github location.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Professional data scientists always do and store their work on Github, or a similar "source control" tool. If your computer blows up, you don't want to lose your work.
Connect the causal-effect
Github repo to an R project on your computer. Name the R project causal-effect
also.
In the Console, run:
list.files()
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Perhaps one of the most important things when working with data is always questioning what we have been told. In this case, it's worth asking questions such as how the participants were selected, what time and where the experiment took place.
Select File -> New File -> Quarto Document ...
. Provide a title ("Causal Effect") and an author (you). Save the document as analysis.qmd
.
In the Console, run:
list.files(all.files = TRUE)
CP/CR.
The all.files = TRUE
argument for list.files()
generates all the files/directories, including the "hidden" ones whose names begin with a period, .
.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
If the experiment took place in rush hours when the trains are crowded and everyone is busy, people may get frustrated when having to share public spaces with immigrants. In that case, rush hour is the covariate that might affect the outcome, the effect of exposing people to Spanish-speakers on their attitude towards immigration.
Edit the .gitignore
by adding *Rproj
and *_files
.
In the Console, run:
tutorial.helpers::show_file(".gitignore")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Consider the same case, if we are aware of this potential factor which may affect the outcome and decide to record the rush hour when collecting the data. In that case, rush hour
, like treatment
is a covariate which is measured but is not the outcome.
Remove everything below the YAML header from analysis.qmd
and save the file. In the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
With the same case again, when putting variables into our DGM, we decide to only include treatment
, but not rush hour
, then only treatment
is considered a covariate.
Add a new code chunk, load the tidyverse and primer.data packages. Save the file and in the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Note that the three above examples represent three different contexts in which we might use the term covariates. The second usage is, obviously, a subset of the first, and the third usage is a subset of the second.
On top of that code chunk, add #| message: FALSE
to prevent messages that are generated when we load up the packages. Save the file and in the Console, run:
tutorial.helpers::show_file("analysis.qmd")
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Discussing covariates in the context of the Preceptor Table is different than discussing covariates in the context of the data. Recall that the Preceptor Table is the smallest possible table, so we don’t need to include every relevant variable, we only need the ones that are necessary to answer the question.
As we render the file, it also shows the code we did to load up the packages in our document. Add #| echo: FALSE
to prevent the code from appearing in our rendered document. Save and render the file again and the code will go away.
In the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
The data we have may not allow us to answer that question, but it may be enough to answer a related question. Is that good enough for the boss/client/colleague who asked the original question? Maybe? You won’t know until you ask.
Always load up the data to see what we have to answer our question. Type trains
and hit "Run Code".
trains
trains
If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Pipe trains
to select(att_end, treatment)
and then to summary()
.
trains |> select(..., treatment) |> ...()
trains |> select(att_end, treatment) |> summary()
The attitude towards immigration after the experiment (att_end) ranges from 3 to 15, with a median of 9 and a mean of 9.139. Notice that there are 51 people in the treatment group and 64 people in the control group, which is a reasonable distribution for comparing the effects of the treatment.
Turning to our analysis.qmd
, add a new code chunk. In this code chunk, type trains
to load the data set. In the Console, run:
tutorial.helpers::show_file("analysis.qmd")
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Recall the key components in Wisdom, we first start a with a question, determine our Preceptor Table. Next, we load up the data we have and examine our data using the concept of "validity".
Within that code chunk, pipe trains
to select(att_end, treatment)
and then assign the result to an object called ch6
. In the Console, run:
tutorial.helpers::show_file("analysis.qmd")
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
As the outcome we care about is the attitude toward immigration at the end of the experiment for both the treatment and the control group, we only need to keep these two variables.
As we render the file, the code shows up again. To prevent adding #| echo: FALSE
in every code chunk, in the YAML header, add:
execute: echo: false
You can delete #| echo: FALSE
in the libraries code chunk as it is no longer needed. Save the file and in the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
The central point to remember is that we have two (potentially!) completely different things: the Preceptor Table (what we need to answer our question) and the data (what we have). Both data sets may have the same columns (e.g. attitudes towards immigration), but it does not mean that they are the same thing. They will often be quite different!
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.
Provide one reason why the assumption of validity might not hold for the outcome variable: att_end
. Use the words "column" or "columns" in your answer.
question_text(NULL, message = "The data was measured in 2014. Would the column att_end still apply now? The aspects of immigration policy which we are most interested in have changed. Politics in America has changed a great deal, not least with the Trump presidency. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.
Provide one reason why the assumption of validity might not hold for the covariate: treatment
. Use the words "column" or "columns" in your answer.
question_text(NULL, message = "The treatment variable in our data is the exposure to Spanish speakers, but would we be able to recreate the same exact treatment today? Probably not! Outside of scientific experiments, it is almost never the case that the treatment variable in the data will perfectly match the treatment variable in the Preceptor Table. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Summarize the state of your work so far in one or two sentences. Make reference to the data you have and to the question you are trying to answer.
question_text(NULL, message = "Using data from a 2012 survey of Boston-area commuters, we seek to measure the causal effect of exposure to Spanish-speakers on attitudes toward immigration among adults in Chicago and similar cities in 2024.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Going back to analysis.qmd
, at the end of the file, type your adjusted answer to this question based on our suggestion. Don't forget to save the file.
Justice is truth in action. - Benjamin Disraeli
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
The Population Table looks like this:
#| echo: false tibble(source = c("...", "Data", "Data", "...", "...", "Preceptor Table", "Preceptor Table", "..."), att_treat = c("...", "7*", "6", "...", "...", "...", "...", "..."), att_control = c("...", "2", "10*", "...", "...", "...", "...", "..."), city = c("...", "Boston, MA", "Boston, MA", "...", "...", "Chicago, IL", "Chicago, IL", "..."), year = c("...", "2012", "2012", "...", "...", "2024", "2024", "..."), treatment = c("...", "No", "Yes", "...", "...", "...", "...", "...")) |> gt() |> tab_header(title = "Population Table") |> cols_label(source = md("Source"), att_treat = md("Treated"), att_control = md("Controlled"), treatment = md("Treatment"), city = md("City"), year = md("Year")) |> tab_style(cell_borders(sides = "right"), location = cells_body(columns = c(source))) |> tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"), locations = cells_column_labels(columns = c(source))) |> cols_align(align = "center", columns = everything()) |> cols_align(align = "left", columns = c(source)) |> fmt_markdown(columns = everything()) |> tab_spanner(label = "Potential Outcomes", columns = c(att_control, att_treat)) |> tab_spanner(label = "Covariates", columns = c(treatment, year, city))
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time references of the Preceptor Table.
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "In this case, the US politics has changed so much since 2012, especially in regard to immigration. Immigration is much more salient now then it was then, so it is likely that the effect of the treatment might be very different today.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
However, if we don’t assume stability then we can’t use data from 2012 to inform our inferences about 2024. So, we assume it.
We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table
. In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.
We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "Our Preceptor Table is not a random draw from the underlying population today as we only care about Chicago (and not any other city).", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.
We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "The Preceptor table might not be representatice of the population at this moment, because it only surveyed people who take train stations in Chicago, but not everyone takes the metro, becaue maybe the richer people go by taxi, or car etc.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates. A model is *confounded* if this is not true.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.
Provide one reason why the assumption of unconfoundedness might not be true (or relevant) in this case.
question_text(NULL, message = "In this case, the assumption of unconfoundedness might not be true if the participants in the treatment and control group are not truly randomly selected, but selected by a person choosing 'randomly' (which is likely not truly random).", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The great advantage of randomized assignment of treatment is that it guarantees unconfoundedness. There is no way for treatment assignment to be correlated with anything, including potential outcomes, if treatment assignment is random.
Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention at least one specific problem which casts doubt on your approach.
question_text(NULL, message = "Using data from a 2012 survey of Boston-area commuters, we seek to measure the causal effect of exposure to Spanish-speakers on attitudes toward immigration among adults in Chicago and similar cities in 2024. There is some concern that the relationship has changed since our data was collected.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Update analysis.qmd
paragraph accordingly. Do not just copy/paste our example answer, obviously!
Courage is going from failure to failure without losing enthusiasm. - Winston Churchill
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage starts with math, explores models, and then creates the data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.
Load the tidymodels package.
library(...)
library(tidymodels)
Because att_end
is a continuous variable, we assume that an individual's attitude toward immigration is produced from a Normal distribution.
$$ att_end_i \sim Normal(\mu, \sigma^2)$$
Load the broom package.
library(...)
library(broom)
Because the outcome variable has a Normal distribution, the link function is the identity. That is:
extract_eq(fit_att, intercept = "beta")
Load the equatiomatic package.
library(...)
library(equatiomatic)
Recall that a continuous variable like att_end
represents an individual's attitude towards immigration and is modeled as being generated from a Normal distribution. To fit this into a mathematical formula, we need to represent the variable appropriately.
Since att_end
is a continuous outcome, it is directly modeled without the need for dummy variables. Instead of converting att_end
into dummy variables, we assume that it follows a Normal distribution:
Add library(tidymodels)
, library(broom)
, and library(equatiomatic)
to the setup
code chunk in analysis.qmd
. Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tidymodels|broom|equatiomatic")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
We need to provide a mathematical formula for our model to work on, using the two variables that are necessary to answer this question: att_end
and treatment
.
Because our outcome variable is continuous, start to create the model by using linear_reg(engine = "lm")
.
linear_reg(engine = "lm")
linear_reg(engine = "lm")
In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.
Continue the pipe to fit(att_end ~ treatment, data = trains)
.
... |> fit(..., data = ...)
linear_reg(engine = "lm")|> fit(att_end ~ treatment, data = trains)
We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:
extract_eq(fit_att, intercept = "beta", use_coefs = TRUE)
Behind the scenes of this tutorial, an object called fit_att
has been created which is the result of the code above. Type fit_att
and hit "Run Code." This generates the same results as using print(fit_att)
.
fit_att
fit_att
The code formula includes 'treatment' a factor variable that has two possible values: "Treated", and "Control".
The math formula includes 'treatment' as a 0/1 variable
Create a new code chunk in analysis.qmd
. Add two code chunk options: label: model
and cache: true
. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_att
.
Command/Ctrl + Shift + K
. It may take some time to render analysis.qmd
, depending on how complex your model is. But, by including cache: true
you cause Quarto to cache the results of the chunk. The next time you render analysis.qmd
, as long as you have not changed the code, Quarto will just load up the saved fitted object.
To confirm, Command/Ctrl + Shift + K
again. It should be quick.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
Add *_cache
to .gitignore
file. Commit and push. Cached objects are often large. They don't belong on Github.
Create another code chunk in analysis.qmd
. Add the chunk option: label: math
. In that code chunk, add something like the below. You may find it useful to add the coef_digits
argument to show fewer significant digits after the decimal.
extract_eq(fit_att, intercept = "beta", use_coefs = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "extract")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
When you render your document, this formula will appear.
extract_eq(fit_att, intercept = "beta", use_coefs = TRUE)
This is our data generating mechanism.
The predicted attitude toward immigration (att_end) is calculated by starting with a baseline value of 8.45. If an individual is part of the treated group (where treatment=1), then 1.55 is added to this baseline. If the individual is not part of the treated group (where treatment=0), the prediction remains at 8.45.
Run tidy()
on fit_att
with the argument conf.int
set equal to TRUE
. This returns 95% confidence intervals for all the parameters in our model.
tidy(..., conf.int = ...)
tidy(fit_att, conf.int = TRUE)
Why is it important to consider the confidence interval (2.5% and 97.5% values) and what does it to convey it to us?
question_text(NULL, message = "The confidence interval provides a range of values within which we can be 95% confident the true estimate lies.We are 95% the true estimate lies within the range of 7.77 to 9.14. This is crucial because while the estimate itself is a single number, the confidence interval reflects the uncertainty around this estimate. The interval gives us a sense of how precise our estimate is—narrow intervals suggest higher precision, while wider intervals indicate more uncertainty. Therefore, it's important to consider the confidence interval to understand the reliability of the estimate.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
What do the p-values indicate in the context of this estimate, and what is the difference between the p-value of the (Intercept) and the treatmentTreated?
question_text(NULL, message = "In general a p.value tells us how statistically significant our results are. It represents the probability of obtaining results similar to the data if we assume that the null hypothesis is true, meaning one of two things. 1 that the true value of the intercept is 0, or 2 that the treatment has not effect, so the coefficient for treatment is 0. So in most scenarios that smaller p.value we have the better, because it means that there is an effect or different/ the treatment does have an effect. A good p.value is less than or equal to 0.05. If the p_value is higher than that, it means that there is no effect, and there is not enough evidence to reject the null hypothesis, meaning your results might be due to random chance and there is no correlation. The small p.value given for (Intercept) means that it is extremely unlikely that the intercept is zero. The relatively small value for treatmentTreated also means that there is a low chance that the coefficient for treatmentTreated is 0 and that it did have an effect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
What do the statistic values from the table indicate in this context, and what type of statistic value are we using here?
question_text(NULL, message = "The test statistic quantifies the size of the effect or difference observed in your data relative to the null hypothesis. It helps determine whether to reject the null hypothesis by indicating how unusual or extreme your sample results are under the assumption that the null hypothesis is true. The type of test statistic depends on the test being used: z-statistic: Used in z-tests for large samples or when the population variance is known. t-statistic: Used in t-tests for smaller samples when the population variance is unknown. Chi-square statistic: Used in chi-square tests for categorical data. F-statistic: Used in ANOVA (Analysis of Variance) tests to compare variances across groups. IN our case, we are using the z-statistic. The statistic of the (Intercept) (24.3) indicates how many standard errors the intercept estimate (8.45) is away from the null hypothesis value (often zero). A high test statistic (like 24.3) suggests that the intercept is significantly different from zero. The statistic for TreatmentTreated Statistic (2.97) measures how many standard errors the treatment effect estimate (1.55) is away from the null hypothesis value (zero). A value of 2.97 suggests that the treatment effect is significant and different from zero.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Create a new code chunk in analysis.qmd
. Add a code chunk option: label: table
. Add this code to the code chunk.
tidy(fit_att, conf.int = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tidy")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add two sentences to your project summary.
First, mention a weakness in your model, derived from the questions above about the key assumptions of a data science problem.
Second, explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."
Recall the beginning of our version of the summary:
Using data from a 2012 survey of Boston-area commuters, we seek to measure the causal effect of exposure to Spanish-speakers on attitudes toward immigration among adults in Chicago and similar cities in 2024. There is some concern that the relationship has changed since our data was collected.
question_text(NULL, message = "We modeled a person's attitude toward immigration, measured on a 3 to 15 integer scale, as a linear function of treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragraph portion of your QMD. Command/Ctrl + Shift + K
, and then commit/push.
Temperance is a bridle of gold; he, who uses it rightly, is more like a god than a man. - Robert Burton
In your own words, describe the use of Temperance in finishing your data science project.
question_text(NULL, message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest.
Load the marginaleffects package.
library(...)
library(marginaleffects)
We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
What is the general topic we are investigating? What is the specific question we are trying to answer?
question_text(NULL, message = "What is the causal effect of attitude toward immigration before, and after exposure to spanish speakers, with a control of no exposure.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.
Enter this code into the exercise code block and hit "Run Code."
plot_predictions(fit_att, condition = "treatment")
plot_predictions(fit_att, condition = "treatment")
predictions(fit_att, condition = "treatment")
The analysis yields an estimate of 10.00 with a 95% confidence interval ranging from 9.24 to 10.76. This interval suggests that the true value of the estimate is very likely to fall within this range, indicating a relatively narrow band of uncertainty around the estimated value.
Add library(marginaleffects)
to the analysis.qmd
setup code chunk.
Create a new code chunk. Label it with label: plot
. Copy/paste the code which creates your graphic.
Command/Ctrl + Shift + K
to ensure that it all works as intended.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)
question_text(NULL, message = "The average causal effect of treatment was about 1.5, with a 95% confidence interval of 0.5 to 2.5. For context, the difference in attitude between Democrats and Republicans is about 1.7. So, the causal effect of 1.5 means that we would expect a treated Democrat to become almost as conservative on immigration as a typical Republican.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
.
Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.
question_text(NULL, message = "In our model we assumed the effect of exposing people to Spanish-speakers on their attitude toward immigration in 2012 is the same in 2024, which is certainly false, the estimates for this value might be wrong. A better estimate would be about 1.2 that represent the reduction of effect overtime, with a wider confidence interval of 0 to 3.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Rearrange the material in your QMD so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K
to ensure that everything works.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This is the version of your QMD file which your teacher is most likely to take a close look at.
Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the url below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add rsconnect
to the .gitignore
file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.
In three-parameter causal models, we can never directly observe both outcomes of a causal relationship—what happens if a certain factor is present and what happens if it’s absent. However, by using statistical methods and assumptions, these models allow us to estimate the effects of each scenario, helping us get closer to understanding the true impact of different variables. Using data from the tibble trains
from primer.data
, we explore the relationship between attitude towards immigration and exposure to people who immigrated ie., the spanish speakers.
This tutorial covered Chapter 6: Three Parameters: Causal of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.