In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

library(learnr)
library(tutorial.helpers)
library(gt)

library(tidyverse)
library(tidymodels)
library(broom)
library(marginaleffects)
library(primer.data)
library(equatiomatic)
library(tidytext)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 600, 
        tutorial.storage = "local") 

# Creates a model used for a plot in Wisdom-6

fit_arrested <- linear_reg() |>
    set_engine("lm") |>
    fit(arrested ~ zone, data = stops)

# Creates new df with 4 entries for race & converts sex and race to be capitalized

x <- stops |>
  filter(race %in% c("black", "white")) |>
  mutate(race = str_to_title(race), 
         sex = str_to_title(sex))

# Store the model from Courage

fit_stops <- linear_reg() |>
    set_engine("lm") |>
    fit(arrested ~ sex + race*zone, data = x)

Introduction

This tutorial is best understood when done after reading Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane. It covers the stops dataset from the primer.data package.

The Question

The power to question is the basis of all human progress. - Indira Gandhi

Exercise 1

Load tidyverse.

library(tidyverse)

library(tidyverse)

The data that we will use was sourced from the Open Policing project. Based at Stanford University, the project aims to improve police accountability and transparency by providing data on traffic stops across the United States. They have many downloadable datasets, and our data is specifically derived from their New Orleans dataset.

Exercise 2

Load the primer.data package.

library(primer.data)

library(primer.data)

A version of the data from the Open Policing project is available in the stops tibble.

Exercise 3

After loading primer.data in your Console, type ?stops in the Console, and paste in the Description below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

stops contains data from over 400,000 traffic stops in New Orleans from July 1, 2011 to July 18, 2018. The dataset includes information about the date, time, and location of each stop, as well as demographic details about the driver and the outcomes of the stop.

Exercise 4

Arrests in traffic stops are the broad topic of this tutorial. Given that topic, which variable in stops should we use as our outcome variable?

question_text(NULL,
    message = "We should be using the `arrested` variable.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

arrested is a binary variable indicating whether or not an arrest was made during the traffic stop.

Exercise 5

Let's imagine a brand new treatment variable which does not exists in the data. This variable should be binary, meaning that it only takes 2 values (TRUE/FALSE, etc.). It should also, at least in theory, be manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.

How might we manipulate this variable?

question_text(NULL,
    message = "Imagine a variable called `mask`, indicating whether or not the person is wearing a mask. We can manipulate this, at least in theory, by giving out masks to half the motorists that we plan on studying, and seeing if the mask affects their chances of getting arrested.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

There can be many good answers for a question like this. However, for the rest of this section, you should stick with our treatment variable of mask.

Recall that in a treatment variable, when manipulated, we look for the difference in arrest rate to see whether or not wearing a mask results in a higher chance of getting arrested.

All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself.

The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables.

Exercise 6

Essentially, we are asking:

What is the causal effect of wearing a mask on getting arrested?

Given this choice of treatment variable mask, how many potential outcomes are there for each arrest? Explain why.

question_text(NULL,
    message = "There are 2 potential outcomes because the treatment variable `mask` takes on 2 posible values: either wearing a mask or not wearing one.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Both a causal effect and a prediction are much fuzzier notions than you might think because there are so many, depending on AGGREGATION.

The same data set can be used to create both causal and predictive models. We can just use different outcome variables and/or specify different treatment variables for the different models, although sometimes, even this isn't required and the same variables work for both types of models.

For a Causal model, any data set can be used as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.

Exercise 7

Write a sentence which speculates as to value of the two different potential outcomes which we might observe in arrested for each person when we change the value of the treatment variable mask.

question_text(NULL,
    message = "For each person, there are only two possible values for arrested: 0 (meaning not arrested) and 1 (meaning arrested). So, there might be a person who, if she had been wearing a mask, would have been arrested, but, since she was not wearing a mask, she was not arrested.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The point of the Rubin Causal Models is that the definition of a causal effect is the difference between potential outcomes. So, there must be two (or more) potential outcomes for any causal model to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes. But, if the treatment variable is continuous, (like income) then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.

The causal effect is the difference between potential outcomes. So, there must be two (or more) potential outcomes for any causal model to make sense.

In our case, this is simple as we have 2 binary outcomes for our variable. But, if the treatment variable is continuous, (like income) then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.

Exercise 8

Write a few sentences which specify the following: - two different values for the treatment variable for every arrest - guesses at the potential outcomes which would result - calculate the causal effect for every arrest given those guesses

question_text(NULL,
    message = "For a given arrest, assume that the value of the treatment variable might be `is wearing a mask` or `isn't wearing a mask`. If the person `is wearing a mask`, then the likelihood of getting arrested would be 2%. If the person `isn't wearing a mask`, then the likelihood of getting arrested would be 15%. The causal effect on the outcome of a treatment of `mask = TRUE` versus `mask = FALSE` is 2 - 15 --- i.e., the difference between two potential outcomes --- which equals -13, which is the causal effect.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Notice how our Causal Effect is negative. This is because it depends on which potential outcome comes first in our question and which second.

Here is our Causal Effect statement below:

The likelihood of getting arrested during a traffic stop in New Orleans is reduced by 13% if the person is wearing a mask.

Exercise 9

Let's consider a predictive model. Which variable in stops do you think might have an important connection to arrested? (If you don't see a reasonable variable in the data, you can just name a variable which might have been included in the data.)

question_text(NULL,
    message = "Let's consider the variable `mask` again. Note that now, instead of being a treatment, it's our key covariate, whose connection to the outcome variable we most want to explore.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 2)

If you don't care what Joe would have done in a counter-factual world in which we got a different treatment, if all you care about is predicting what Joe does given the treatment he received, then you just need a predictive model.

Using the same variable allows us to see the true differences between the two model types, based on similar questions and the same variables. Of course, you may have said something else, and that is completely fine, but we should stick with mask for the rest of this section.

Exercise 10

Write a few sentences which specify two different groups of traffic stops with different values for mask. Explain how the average value of arrested might differ between these two groups of traffic stops.

question_text(NULL,
    message = "Some traffic stops might have a value for `mask` of `TRUE`. Others might have a value of `FALSE`. Those two groups will, on average, have different values for `arrested`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The key point is that, with a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be treatment variables. We assuming that all covariates are "fixed." In that case, we should not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.

Exercise 11

Write a predictive question which connects the outcome variable arrested to race, the covariate of interest for the rest of this tutorial.

question_text(NULL,
    message = "What is the difference in arrest rate between Black and White drivers adjusting for other variables?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This is going to be our question for today. With a predictive model, your question should focus on a comparison between different rows, or groups of rows, in the Preceptor Table.

Exercise 12

What is a Quantity of Interest which might help us to explore the answer to our question?

question_text(NULL,
    message = "The probabilily of arrest for a black driver.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.

Wisdom

The only true wisdom is in knowing you know nothing. - Socrates

Recall our question:

What is the difference in arrest rate between Black and White drivers adjusting for other variables?

Exercise 1

In your own words, describe the key components of Wisdom when working on a data science problem.

question_text(NULL,
    message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Open Policing project also has visual maps regarding traffic stops. Sadly, there isn't one for New Orleans, but observe how different parts of Hartford, CT pull over different races of people!

Exercise 2

Define a Preceptor Table.

question_text(NULL,
    message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantities of interest.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table does not include all the covariates which you will eventually including in your model. It only includes covariates which you need to answer your question.

Exercise 3

Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."

question_text(NULL,
    message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 4

Create a Github repo called stops. Make sure to click the "Add a README file" check box.

Connect the stops Github repo to an R project on your computer. Name the R project stops also.

Select File -> New File -> Quarto Document .... Provide a title ("Stops") and an author (you). Render the document and save it as stops.qmd.

Edit the .gitignore by adding *Rproj. Save and commit this in the Git tab. Push the commit.

In the Console, run:

show_file(".gitignore")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Remove everything below the YAML header from stops.qmd and render the file. Command/Ctrl + Shift + K renders the file, this automatically saves the file as well.

Exercise 5

What are the units for this problem?

question_text(NULL,
    message = "The units are the races of the people that have been pulled over.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

If our question were to have concerned the percentage of people that got arrested from traffic stops, then our units would have been the individual people. However, since it concerns the broad groups of people, it only makes sense for our units to be thos ebroad groups (the races).

Exercise 6

What is the outcome variable for this problem?

question_text(NULL,
    message = "Our outcome variable is `arrested`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

plot_predictions(fit_arrested,
                      newdata = expand_grid(zone = unique(x$zone)),
                      condition = c("zone")) +
  coord_flip() +
  scale_x_reordered() +
  labs(x = "Zone", 
       y = "Estimated Arrest Probability", 
       title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
       subtitle = "The bars represent the 95% Confidence Intervals.",
       caption = "Data from the Open Policing Project")

Regardless, the central lesson is always the same: You can never look at your data too much.

Exercise 7

What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?

question_text(NULL,
    message = "We obviously need `race`. However, other characteristics like the person's age, sex, and the type of car they're driving might also affect `race`. It would also be ebenficial to see if the zone of the officer impacts the arrest rates, so we should also use the variable `zone`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

For your information, the term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables which we have data for. Third, it is the set of covariates which we end up using in the model.

Exercise 8

What are the treatments, if any, for this problem?

question_text(NULL,
    message = "Our key covariates would be `race`, `sex`, `age`, and `zone`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Note that because we're creating a Predictive model, there is no treatment variable per se. However, we still have key covariates.

Exercise 9

What moment in time does the Preceptor Table refer to?

question_text(NULL,
    message = "Our Preceptor Table should represent traffic stops somewhere about the late 2000s to the mid 2010s for maximum compatibility with our data.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 10

Define causal effect.

question_text(NULL,
    message = "A causal effect is the difference between two potential outcomes.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 11

What is the fundamental problem of causal inference?

question_text(NULL,
    message = "The fundamental problem of causal inference is that we can only observe one potential outcome.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 12

How does the motto "No causal inference without manipulation." apply in this problem?

question_text(NULL,
    message = "The motto does not apply because this is a predictive, not causal, model.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 13

Describe in words the Preceptor Table for this problem.

question_text(NULL,
    message = "On the far left, our Preceptor Table must have some way of idenitfying the people. Then, we would have a column called `arrested` (our outcome variable) and following this would be our covariate columns: `race`, `sex`, and `zone`.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Preceptor Table for this problem looks something like this:

tibble(ID = c("1", "2", "...", "10", "11", "...", "105,852,176"),
       arrested = c("TRUE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE"),
       sex = c("M", "F", "...", "F", "F", "...", "M"),
       race = c("Black", "White", "...", "Black", "White", "Black", "Black"),
       zone = c("...", "B", "M", "...", "W", "...", "A"))|>

  gt() |>
  tab_header(title = "Preceptor Table") |>
  cols_label(ID = md("ID"),
             arrested = md("Arrested"),
             sex = md("Sex"),
             race = md("Race"),
             zone = md("Zone")) |>
  tab_style(cell_borders(sides = "right"),
            location = cells_body(columns = c(ID))) |>
  tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"),
            locations = cells_column_labels(columns = c(ID))) |>
  cols_align(align = "center", columns = everything()) |>
  cols_align(align = "left", columns = c(ID)) |>
  fmt_markdown(columns = everything()) |>
  tab_spanner(label = "Outcome", columns = c(arrested)) |>
  tab_spanner(label = "Covariate", columns = c(sex, race, zone))

Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem. For example, at the start, we aren't sure what right-hand side variables will be included in the model, so we are not yet sure which covariates must be in the Preceptor Table.

Exercise 14

Write one sentence describing the data you have to answer your question.

question_text(NULL,
    message = "Our data comes from the Open Policing Project, and was collected from July 1, 2011 through July 18, 2018. Our data contains roughly 400,000 entries, contain crucial information regarding the person's age, sex, and race, and adressing whether or not they were arrested.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Your sentence doesn't have to be as long or in-depth as the answer, but your response should still contain the crucial imformation.

Exercise 15

In stops.qmd, load the tidyverse and the primer.data packages in a new code chunk. Label it the set up by adding #| label: setup. Render the file.

Notice that the file does not look good because it is has code that is showing and it also has messages. To take care of this, add #| message: false to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all echo from the whole file:

execute: 
  echo: false

In the Console, run:

show_file("stops.qmd", start = -5)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Render again. Everything looks nice because we have added code to make the file look better and more professional.

Exercise 16

Load the tidyverse package.

library(...)

library(tidyverse)

Exercise 17

Let's get to know our data.

Start by exploring our dataset, by typing stops into the Console. Ensure that the Tidyverse package is loaded in the Console, as without it, the full dataset would be displayed.

stops

stops

In this first glimpse, I only see two races present. I wonder if this dataset contains more...

Exercise 18

To find out if we have more than 2 races, run the following command in the box below.

table(stops$race)

table(stops$race)

By using $ to specify a specific variable, table() shows the different unique entries in the column, displaying the number of times that it has been mentioned.

Exercise 19

For our model, let's just focus on the entries for black and white (due to the low number of observations for all other races).

To do this, start by piping stops to filter(). Then, inside filter(), use the argument race %in% c("black", "white").

... |>
filter(race %...% c("black", "..."))

stops |>
filter(race %in% c("black", "white"))

Exercise 20

Additionally, it's important that we have the values for race and sex be capitalized. This is much easier when done initially in our modified data.

Continue the pipe to mutate(), in which you should set race to str_to_title(race) and sex to str_to_title(sex).

... |>
  mutate(... = str_to_title(...), 
         ... = str_to_title(...))

stops |>
  filter(race %in% c("black", "white")) |>
  mutate(race = str_to_title(race), 
         sex = str_to_title(sex))

Note that in this step, we are converting the strings to titles, so that they remain capitalized throughout our graph. This is important as later, we will have to capitalize "Black" to match what we did in this step.

Exercise 21

We have saved this pipeline to the following object:

x <- stops |>
  filter(race %in% c("black", "white")) |>
  mutate(race = str_to_title(race), 
         sex = str_to_title(sex))

Create a new code chunk in stops.qmd. Add #| label: eda. Copy/Paste the code for the x object. Render and run tutorial.helpers::show_file("stops.qmd", chunk = "last"). CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

We will use this cleaned up data in Courage when we make our model.

Exercise 22

In your own words, define "validity" as we use the term.

question_text(NULL,
    message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.

Exercise 23

Provide one reason why the assumption of validity might not hold for the outcome variable: arrested. Use the words "column" or "columns" in your answer.

question_text(NULL,
    message = "Although we know that an arrest in our dataset means an actual arrest, we don't know if that's the case with the Preceptor Table. What if arrests in the Preceptor Table also include detainments, which aren't full arrests? in this case, we would see a lot of entries for arrests in our `arrested` column in the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 24

Provide one reason why the assumption of validity might not hold for the covariate zone.

question_text(NULL,
    message = "An instance where validity would not hold would be if the `zone` column is referring to different zones in both data sources. It could be that the zones are renamed in one, or that one is referring to completely different zones entirely.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 25

Summarize the state of your work so far in one sentence. Make reference to the data you have and to the specific question you are trying to answer.

question_text(NULL,
    message = "Using data from a study of New Orleans drivers, we seek to understand the relationship between driver race and the probabilty of getting arrested during a traffic stop.In particular, what is the probability of a Black motorist getting arrested during a traffic stop, and how does this compare to that of a White motorist?",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Edit you answer as you see fit, but do not copy/paste our answer exactly. Add this summary to stops.qmd, Command/Ctrl + Shift + K, and then commit/push.

Justice

Justice delayed is justice denied. - William E. Gladstone

Exercise 1

In your own words, name the four key components of Justice for working on a data science problem.

question_text(NULL,
    message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.

Exercise 2

In your own words, define a Population Table.

question_text(NULL,
    message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.

Exercise 3

In your own words, define the assumption of "stability" when employed in the context of data science.

question_text(NULL,
    message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time references by the Preceptor Table.

Exercise 4

Provide one reason why the assumption of stability might not be true in this case.

question_text(NULL,
    message = "We don't know the exact time period of the Preceptor Table, so we don't know if any laws were passed between the time periods of both data sources that affected the basis of arrests during traffic stops. Therefore, we don't know if the arrests reported in the `arrested` column in both data sources were made on the same basis and whether or not both data sources had the same arrest rate.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 5

We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table. In your own words, define the assumption of "representativeness" when employed in the context of data science.

question_text(NULL,
    message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Stability looks across time periods. Representativeness looks within time periods.

Exercise 6

We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "The dataset `stops` is a heavily modified version of the data from the actual study, and therefore has left out nearly 3.1 million entries from the real data, shortening it to roughly 400,000 entries. The deletion of the entries may have led to a misrepresentation of the population, in that a lot of the current data may only be fr0om select areas with select conditions present, and could be from biased officers who are more likely to arrest drivers compared to other officers in the zone. This would mess up our final predictions because it would be providing values that are unrealistic to the area, causing our entire model to be unreastic.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Remember that our data and our Preceptor Table are two completely different datasets, and are only able to merge once we have identified that they both concern the same population.

It's always a concern regarding the methods that the data was collected. If they were collected from different bodies or populations that didn't meet eye-to-eye in key aspects, the data may be conflicting.

Exercise 7

We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.

question_text(NULL,
    message = "Our data may not be collected fairly, in that some of the data could have just been collected from corrupt officers who would be more likely to arrest drivers than other officers in the zone. The opposite could be inferred as well. This, again, would mess up our final predictions because it would be providing values that are unrealistic to the area, causing our entire model to be unreastic.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 8

In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.

question_text(NULL,
    message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.

Exercise 9

Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention one specific problem which casts doubt on your approach.

question_text(NULL,
    message = "Using data from a study of New Orleans drivers, we seek to understand the relationship between driver race and the probabilty of getting arrested during a traffic stop. However, our data from both our Preceptor Table and our dataset may not fully represent the population as both may not be from the same time frame and some of our data may come from biased officers, who may target certain groups of individuals.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Edit the summary paragraph in stops.qmd as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K, and then commit/push.

Courage

Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe

Recall that we ar mainly comparing the difference between the two values of race (a character column) on our outcome variable arrested (an integer column). We will also be analyzing the affects of sex and zone, which are character variables and age, which is an integer variable.

Exercise 1

In your own words, describe the components of the virtue of Courage for analyzing data.

question_text(NULL,
    message = "Courage starts with math, explores models, and then creates the data generating mechanism.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.

Exercise 2

Load the tidymodels package.

library(...)

library(tidymodels)

Because arrested is a binary variable, we assume that the outcome of getting arrested (or not) is produced from a Bernoulli distribution.

$$ arrested_i \sim Bernoulli(\rho) $$

Note that "binomial" is another, more common, word for Bernoulli.

Exercise 3

Load the broom package.

library(...)

library(broom)

Because we are using a Bernoulli distribution, the link function is logit. That is:

$$\rho = \frac{1}{1 + e^{-(\beta_0 +\beta_1 x_1 + \dots)}}$$

Exercise 4

Load the equatiomatic package.

library(...)

library(equatiomatic)

Exercise 5

Add library(tidymodels), library(broom), library(equatiomatic), and library(tidytext) to the setup code chunk in stops.qmd. Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", pattern = "tidymodels|broom|equatiomatic|tidytext")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy 0/1 variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. In our case, zone has 25 values --- "A", "B", "C", and so on --- so the model creates 24 0/1 dummy variables (TRUE/FALSE), giving them names like zoneB and zoneC, and so on. The results for the first category (Zone A) are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.

Exercise 6

Because our outcome variable is binary, start to create the model by using linear_reg(engine = "lm").

linear_reg(engine = "...")

linear_reg(engine = "lm")

Note: This will give you an error that we will be fixing later.

In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.

Exercise 7

We will be using the following formula:

arrested ~ sex + race*zone

Continue the pipe to fit(), pasting in the formula, and then adding the argument data = x.

... |> 
  fit(arrested ~ ... + race*..., data = ...)

linear_reg() |>
  set_engine("lm") |>
  fit(arrested ~ sex + race*zone, data = x)

We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:

extract_eq(fit_stops$fit,
           intercept = "beta",
           wrap = TRUE,
           use_coefs = TRUE,
           terms_per_line = 2)

extract_eq(fit_stops$fit,
           intercept = "beta",
           wrap = TRUE,
           use_coefs = TRUE,
           terms_per_line = 2)

Exercise 8

Behind the scenes of this tutorial, an object called fit_stops has been created which is the result of the code above. Type fit_stops and hit "Run Code." This generates the same results as using print(fit_stops).

fit_stops

fit_stops

The Intercept parameter indicates when sexmale = 0, i.e. it indicates females on average.

Exercise 9

Create a new code chunk in stops.qmd. Add two code chunk options: label: model and cache: true. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_stops.

Command/Ctrl + Shift + K. It may take some time to render stops.qmd, depending on how complex your model is. But, by including cache: true you cause Quarto to cache the results of the chunk. The next time you render stops.qmd, as long as you have not changed the code, Quarto will just load up the saved fitted object.

To confirm, Command/Ctrl + Shift + K again. It should be quick.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

Exercise 10

Create another code chunk in stops.qmd. Add the chunk option: label: math. In that code chunk, add something like the following:

extract_eq(fit_stops$fit,
           intercept = "beta",
           wrap = TRUE,
           use_coefs = TRUE,
           terms_per_line = 2)

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", pattern = "extract")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

When you render your document, this formula will appear.

extract_eq(fit_stops$fit,
           intercept = "beta",
           wrap = TRUE,
           use_coefs = TRUE,
           terms_per_line = 2)

This is our data generating mechanism.

The formula shows us how each parameter compares to our intercept, which is the first parameter. In this case, the intercept is 0.18, inddicating that for females (the first parameter, so it becomes the intercept) the arrest rate is 18%. If we want to find the arrest rate for the white race (in general) it would subtract 0.4, giving us an answer of 0.14, or 14%.

Exercise 11

Run tidy() on fit_stops with the argument conf.int set equal to TRUE. This returns the 95% intervals for all parameters in our model.

tidy(..., conf.int = ...)

tidy(fit_stops, conf.int = TRUE)

Exercise 12

Write a sentence interpreting the ~0.06 estimate for sexMale.

question_text(NULL,
    message = "When comparing men with women, men have a 0.06 higher value for `arrested`, meaning that they are more likely to get arrested during a traffic stop relative to women, conditional on the other variables in the model.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The phrase "conditional on the other variables in the model" is important. It could be shortened to "conditional on the model." This phrase acknowledges that there are many, many possible models, just considering all the different combinations of independent variables we might include. Each one would produce a different coefficient for sexmale. None of these is the true coefficient.

Exercise 13

Between Zone D and Zone F, in which zone would the average person be more likely to get arrested during a traffic stop? In which zone would the expected average arrest rate be lower than the average of all females in New Orleans?

question_text(NULL,
    message = "We expect that the average person be more likely to get arrested at Zone D because of its higher probability than Zone F. Given that our Intercept represents the average arrest rate of all females in New Orleans, we expect that the person would be less likely to get arrested in Zone F than our Intercept because of Zone F's negative estimate.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Because we are Bayesians, we believe that there is a true value and that the confidence or credible or uncertainty interval includes it at the stated level. This is different from the Frequenist interpretation, for which you should see here.

Exercise 14

Write a sentence interpreting the -0.04 estimate for raceWhite.

question_text(NULL,
    message = "if we compare people who are White (figuratively the treated) with people who are Black (figuratively the control), the treated people have, on average, a 0.04 lower chance of getting arrested, adjusting for other individual characteristics.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

The key point is that there is no such thing as a causal (versus preditive) data set nor a causal (versus predictive) R code formula. You can use the same data set (and the same R code!) for both causal and predictive models. The difference lies in the assumptions you make.

Exercise 15

Create a new code chunk in stops.qmd. Add a code chunk option: label: table. Add this code to the code chunk.

tidy(fit_stops, conf.int = TRUE)

Command/Ctrl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", pattern = "tidy")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Exercise 16

Write a few sentence which summarize your work so far. The first few sentences are the same as what you had at the end of the Justice Section. Add one sentence which describes the modelling approach which you are using, specifying the functional form and the dependent variable. Add one sentence which describes the direction (not the magnitude) of the relationship between one of your independent variables and your dependent variable.

question_text(NULL,
    message = "Using data from a study of New Orleans drivers, we seek to understand the relationship between driver race and the probabilty of getting arrested during a traffic stop. However, our data from both our Preceptor Table and our dataset may not fully represent the population as both may not be from the same time frame and some of our data may come from biased officers, who may target certain groups of individuals. However, these concerns don't appear to be valid in a substantial manner in either dataset, allowing us to continue in our process. We modeled `arrested` as a linear function of both `sex` and the product of `race` and `zone`. From this, we examined that Males are less likely of getting arrested than Females.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Temperance

Temperance is a bridle of gold; he, who uses it rightly, is more like a god than a man. - Robert Burton

Exercise 1

In your own words, describe the use of Temperance in data science.

question_text(NULL,
    message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest. We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Exercise 2

Load the marginaleffects package.

library(...)

library(marginaleffects)

We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.

Exercise 3

Load the tidytext package.

library(...)

library(tidytext)

The tidytext package essentially adds on to what the other "tidy-" packages achieve, but makes a lot more tasks simpler to do. Today we will be using it to make reordering zones easier.

Exercise 4

What is the general topic we are investigating? What is the specific question we are trying to answer?

question_text(NULL,
    message = "We are generally investigating the likelihood of getting arrested during a Traffic Stop in New Orleans. We are specifically interested in the probabilty of arrest for a Black driver.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.

Exercise 5

To answer our question, we need to create an object --- call it ndata --- which we will pass in as a value to the newdata argument in whichever marginaleffects functions we decide to use. Which variables (e.g., which columns) do we need to include in this object?

question_text(NULL,
    message = "We need to include the variables `race`, `sex`, and `zone` in our `ndata` Tibble.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

For most models, each row in ndata corresponds to a posterior derived from the values of the variables in that row. But that is just the posterior for one sort of unit. There are lots of different units! Which others might we be interested in? We can generate posteriors for each of them, and then, in some cool graphic, display all those posteriors together.

Exercise 6

Which values do you want the variables in your ndata object to have? This is not easy! At the very least, one or more of the rows should have values which allow you to answer your original question. But, now that you have a model, there are many questions which you might want to answer, the better to get a fuller understanding.

question_text(NULL,
    message = "We want to have `race` contain entries containing both `black` and `white`, we want `sex` to contain entries for both `male` and `female`, and we want zone to contain entries for all 25 zones in our data.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

However, we won't be making an ndata object today. Instead, we'll be relying on an argument called "balanced", which automatically uses all possible parameters.

Functions like unique() --- to grab all the possible values in a variable --- and expand_grid() --- to create all possible combinations of different variables --- are often useful in creating ndata.

Exercise 7

Add library(marginaleffects) and library(tidytext) to the stops.qmd setup code chunk. Command/Ctrtl + Shift + K.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", pattern = "marginaleffects|tidytext")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

However, if we used data.frame() instead of expand_grid() when creating the ndata object, we would've got an error. This is because the tuition and selectivity columns are not equal in length. Instead of finding all the possible combinations, data.frame() makes only one row for each value in the column. Example:

data.frame(race = unique(x$race),
           sex = unique(x$sex),
           zone = unique(x$zone))

When I run this, I get the following message because of the lengths:

message("Error in data.frame(race = unique(x$race), sex = unique(x$sex),  : 
  arguments imply differing number of rows: 2, 25")

Exercise 8

Enter this code into the exercise code block and hit "Run Code."

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"))

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"))

Notice how we used the "balanced" argument for newdata, since we need all of the possible outcomes.

Exercise 9

Although it's nice to graph our model in one function, it would be nice to have it in order. To do this, we first have to convert our predictions to a tibble.

Copy the previous code and add draw = FALSE to your plot_predictions() call. Then pipe it to as_tibble().

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_...()

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble()

Exercise 10

We can now proceed to order our data. I want to arrange the graph in ascending order for Black arrest rates by sex.

To do this, first we need to set a grouping. Pipe to group_by(), with the arguments zone and sex.

Next, pipe to mutate(), setting sort_order eequal to estimate[race == "Black"].

Finally, pipe to ungroup().

... |> 
  group_by(zone, ...) |>
  ...(sort_order = estimate[... == "Black"]) |>
  ungroup()

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup()

Notice how we capitalized "Black". This is because we had set race to consist of titles in Wisdom.

Exercise 11

Now pipe to mutate() and set the argument zone to equal reorder_within(zone, sort_order, sex).

... |> 
  ...(zone = reorder_within(..., sort_order, ...))

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex))

Note that the function reorder_within() is from the tidytext package that we called earlier. This function saved us roughly 10+ lines of code.

Exercise 12

Now we should plot our findings. Continue the pipe to ggplot(), and inside aesthetics, map x to zone and color to race.

Then add geom_errorbar(), with the arguments aes(ymin = conf.low, ymax = conf.high), width = 0.2, and position = position_dodge(width = 0.5).

... |>
  ggplot(aes(x = ..., y = ..., color = ...)) +
    geom_errorbar(...(ymin = conf.low, ... = conf.high), 
                  ... = 0.2,
                  ... = position_dodge(... = 0.5))

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5))

We added the position_dodge() argument to ensure that the bars don't touch, but instead are slightly spread apart so if there's any overlap, we can see the difference clearly.

Exercise 13

Now let's add the points. Continue the pipe to geom_point() with the arguments aes(y = estimate), size = 1, position = position_dodge(width = 0.5). Note that this is the same position_dodge() argument that we added in the last step, and how we also need it to display our points in the right place.

... +
  geom_point(...(y = ...), 
             ... = 1, 
             position = position_dodge(width = ...))

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5))

Exercise 14

Let's make 2 plots, one for each value of sex. Add facet_wrap(). Inside this, add ~sex and set scales to equal "free_x".

... +
    facet_wrap(~sex, ... = "free_x")

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5)) +
  facet_wrap(~ sex, scales = "free_x")

Exercise 15

Now let's do something about those ugly x-axis labels.

First. let's make them only reflect their zone and drop the race argument. To do this, add scale_x_reordered() to the previous code.

Second, let's make the labels a bit smaller so that they're easier to read. To do this, add the following to the previous code: theme(axis.text.x = element_text(size = 8)).

... +
    scale_x_...() +
    ...(axis.text.x = ..._text(... = 8))

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5)) +
  facet_wrap(~ sex, scales = "free_x") +
  scale_x_reordered() +
  theme(axis.text.x = element_text(size = 8))

Exercise 16

Additionally, let's put the y-axis labels in percent (%) format. Add scale_y_continous(), setting labels to equal percent_format().

... +
  scale_y_continuous(... = percent_format())

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5)) +
  facet_wrap(~ sex, scales = "free_x") +
  scale_x_reordered() +
  theme(axis.text.x = element_text(size = 8)) +
  scale_y_continuous(labels = percent_format())

Exercise 17

Finish it off by adding a proper title, subtitle, axis titles, caption, and title for the key with labs(). Your graph should look something like this:

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5)) +
  facet_wrap(~ sex, scales = "free_x") +
  scale_x_reordered() +
  theme(axis.text.x = element_text(size = 8)) +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Zone", y = "Estimated Arrest Probability (%)",
       title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
       subtitle = "Black motorists are more likely to get arrested during a traffic stop than White motorists.",
       color = "Race",
       caption = "Data from the Stanford Open Policing Project")

... +
  labs(...)

plot_predictions(fit_stops$fit,
                 newdata = "balanced",
                 condition = c("zone", "race", "sex"),
                 draw = FALSE) |> as_tibble() |> 
  group_by(zone, sex) |>
  mutate(sort_order = estimate[race == "Black"]) |>
  ungroup() |>
  mutate(zone = reorder_within(zone, sort_order, sex)) |>
  ggplot(aes(x = zone, 
             color = race)) +
  geom_errorbar(aes(ymin = conf.low, 
                    ymax = conf.high), 
                width = 0.2,
                position = position_dodge(width = 0.5)) +
  geom_point(aes(y = estimate), 
             size = 1, 
             position = position_dodge(width = 0.5)) +
  facet_wrap(~ sex, scales = "free_x") +
  scale_x_reordered() +
  theme(axis.text.x = element_text(size = 8)) +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Zone", y = "Estimated Arrest Probability (%)",
       title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
       subtitle = "Black motorists are more likely to get arrested during a traffic stop than White motorists.",
       color = "Race",
       caption = "Data from the Stanford Open Policing Project")

Exercise 18

Create a new code chunk in stops.qmd. Label it with label: plot. Copy/paste the code which creates your graphic.

Command/Ctrl + Shift + K to ensure that it all works as intended.

At the Console, run:

tutorial.helpers::show_file("stops.qmd", start = -8)

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Exercise 19

Write a paragraph which summarizes the project in your own words. The first few sentences are the same as what you had at the end of the Courage section. But, since your question may have evolved, you should feel free to change those sentences. Add at least one sentence which describes at least one quantity of interest (QoI) and which provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)

question_text(NULL,
    message = "Using data from a study of New Orleans drivers, we seek to understand the relationship between driver race and the probabilty of getting arrested during a traffic stop. However, our data from both our Preceptor Table and our dataset may not fully represent the population as both may not be from the same time frame and some of our data may come from biased officers, who may target certain groups of individuals. However, these concerns don't appear to be valid in a substantial manner in either dataset, allowing us to continue in our process. We modeled `arrested` as a linear function of both `sex` and the product of `race` and `zone`. From this, we examined that Males are less likely of getting arrested than Females. Focusing on our question, there is no uncertainty associated with our estimate because it, itself, is an expression of uncertainty. We estimate that the probaibility of a Black driver in New Orleans getting arrested during a traffic stop is roughly 25%, compared to roughly 20% for White drivers.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Edit the summary paragraph in stops.qmd as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K.

Exercise 20

Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.

question_text(NULL,
    message = "Based on all of the errors in our data that we collected during Justice, our original data source to base our model on was already flawed. Additionally, our model can only predict estimates on paper, based on an ideal world. It doesn't account for many factors of the real world that we, ourselves, can't even describe. However, our estimate was formed to the best of our abilities given the data on hand, and all I would do to modify it would be to modify the confidence intervals: I would keep the upper confidence interval the same, but would singifcantly decrease the lower confidence interval to roughly 15%-17% to account for the future, where the values would most likely decrease based on the pattern of racism in the US decreasing over time.",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

Exercise 21

Rearrange the material in stops.qmd so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K to ensure that everything works.

At the Console, run:

tutorial.helpers::show_file("stops.qmd")

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Add rsconnect to the .gitignore file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.

Exercise 22

Publish stops.qmd to Rpubs. Choose a sensible slug. Copy/paste the url below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

Summary

This tutorial covered topics related to Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.

PPBDS/primer.tutorials documentation built on April 3, 2025, 3:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PPBDS/primer.tutorials Tutorials for Preceptor's Primer for Bayesian Data Science

In PPBDS/primer.tutorials: Tutorials for Preceptor's Primer for Bayesian Data Science

Introduction

The Question

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Wisdom

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Exercise 19

Exercise 20

Exercise 21

Exercise 22

Exercise 23

Exercise 24

Exercise 25

Justice

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Courage

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Temperance

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

PPBDS/primer.tutorials
Tutorials for Preceptor's Primer for Bayesian Data Science