library(learnr) library(tutorial.helpers) library(gt) library(tidyverse) library(primer.data) library(tidymodels) library(gtsummary) library(equatiomatic) library(marginaleffects) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") # Because we want the undertainty about the true mean of men and women to be # large enough to show up in out plot in an interesting way, we pretend that we # only have 100 observations in the data, 50 men and 50 women. set.seed(10) x <- nhanes |> filter(age >= 18) |> select(height, sex) |> drop_na() |> slice_sample(n = 50, by = sex) fit_height <- linear_reg() |> set_engine("lm") |> fit(height ~ sex, data = x) plot <- plot_predictions(fit_height, condition = "sex")
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
The world confronts us. Make decisions we must.
The reason for making models is not, primarily, that making models is fun --- although it is! The reason is that the world confronts us. Make decisions we must. We must decide between options X or Y. We must choose from actions A, B and C. Confronted by a choice, we need to make a model of the world to help us choose wisely.
Imagine that you are in charge of ordering uniforms for bootcamp recruits in the Marine Corps for next year. There are many factors to consider: the cost of different designs, the number of male and female recruits, the distributions of heights and weights, and so on. What should you order?
A prudent question is one half of wisdom. - Francis Bacon
What is the average height of men and women?
Load tidyverse package.
library(tidyverse)
library(tidyverse)
We will be using the data set about US people's body shape and related measurements from the US National Health and Nutrition Examiniation Survey (NHANES) for the 2009-2010 and 2011-2012. See this link for further details about NHANES.
Load the primer.data package.
library(primer.data)
library(primer.data)
A version of the data from the National Health and Nutrition Examination Survey is available in the nhanes
tibble. This tibble consists r scales::comma(nrow(nhanes))
rows and r scales::comma(ncol(nhanes))
columns.
After loading primer.data in your Console, type ?nhanes
in the Console, and paste in the Description below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
In order to get a demographically diverse range of responses, minority groups were surveyed more heavily. However, this skewed the survey results to have an inaccurate demographic makeup. To address this, some observations from more common groups are resampled in the data.
Height is the broad topic of this tutorial. Given that topic, which variable in nhanes
should we use as our outcome variable?
question_text(NULL, message = "We will be using `height` as an outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
height
is measured in centimeters.
nhanes |> select(height) |> drop_na() |> ggplot(aes(height)) + geom_histogram(bins = 120) + labs(title = "Height from NHANES", subtitle = "NHANES includes children and adults", x = "Height (cm)", y = "Number")
Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.
Describe this variable and how might we manipulate its value?
question_text(NULL, message = "Imagine a variable called `vitamin`. We can manipulate the value of `vitamin`, at least in theory, by providing children in the treatment group with vitamins throughout their childhood while not giving vitamins to children in the control group.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In this case, vitamin
is a binary variable for which 1
means that the individual ate vitamins growing up and 0
means they did not. And obviously, there is no right or wrong answer as long as the variable you suggest can be manipulated, at least in theory. But, going forward, we will use the variable vitamin
.
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
Given our choice of treatment variable vitamin
, how many potential outcomes are there for each individual? Explain why.
question_text(NULL, message = "There are 2 potential outcomes because the treatment variable `vitamin` takes on 2 posible values: take vitamin versus no vitamin.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. All of this stuff is a conceptual framework we apply to the data. It is never inherent in the data itself.
In a few sentences, specify two different values for the imaginary treatment variable vitamin
, for a single unit, and then guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.
question_text(NULL, message = "For a given individual, assume that the value of the treatment variables might be `take vitamin` or `no vitamin`. If the individual gets `take vitamin`, then `height` would be 175. If the individual gets `no vitamin`, then `height` would be 173. The causal effect on the outcome of a treatment of `vitamin` versus `no vitamin` is 175 - 173 --- i.e., the difference between two potential outcomes --- which equals 2, which is the causal effect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The the definition of a causal effect as the difference between two potential outcomes. Of course, you can't just say that the causal effect is 2. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.
Any causal connection means exploring the within row different between two potential outcomes. We don't need to look at any other rows to have that conversation.
Let's consider a predictive model. Which variable in nhanes
do you think might have an important connection to height
?
question_text(NULL, message = "`sex` is a potential variable that may relate to `height`.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be treatment variables. We assuming that all covariates are "fixed."
Specify two different groups of individuals which have specific value for sex
and which might have different average values for height
.
question_text(NULL, message = "Some individuals might have a value for `sex` of `Male`. Others might have a value of `Female`. Those two groups will, on average, have different values for the outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.
You can only use causal language --- like "affect," "influence," "be associated with," "cause,", "causal effect," et cetera --- in your question if you are creating a causal model, one with a treatment variable which you might, at least in theory, manipulate and with at least two potential outcomes.
Write a predictive question which connects the outcome variable height
to sex
, the covariate of interest.
question_text(NULL, message = "What is the average height of men and women?", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We call the answer to this question is your "Quantity of Interest."
Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.
The only true wisdom is in knowing you know nothing. - Socrates
Our question:
What is the average height of men and women?
In your own words, describe the key components of Wisdom for working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The central problem for Wisdon is: Can we use data from nhanes
to predict the average height of all men and women in the world now? When was the data collected? What does each row in the data represent? Is it for height of each male and female in the world now?
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, it is easy to calculate the quantities of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table does not include all the covariates which you will eventually including in your model. It only includes covariates which you need to answer your question.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This problem is predictive so there are only covariates. In our problem, the covariate is each individual's sex.
Create a Github repo called two-parameters
. Make sure to click the "Add a README file" check box.
Connect the Github repo to an R project on your computer. Give the R project the same name.
Select File -> New File -> Quarto Document ...
. Provide a title ("Two-Parameters"
) and an author (you). Render the document and save it as analysis.qmd
.
Edit the .gitignore
by adding *Rproj
. Save and commit this in the Git tab. Push the commit.
In the Console, run:
show_file(".gitignore")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 7)
Remove everything below the YAML header from analysis.qmd
and render the file. Command/Ctrl + Shift + K
renders the file, this automatically saves the file as well.
What are the units for this problem?
question_text(NULL, message = "Every individual in the world, one row per person. The rows of the Preceptor Table are the units, the objects on which the outcome is measured. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the question is: What is the average height of adult males in the US, then the units are only men in the US. There is the interplay between the exact question and which units define the Preceptor Table.
Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.
What is the outcome for this problem?
question_text(NULL, message = "Height is the outcome variable. This is not the same thing as the answer to the question we have beeen asked. But, if we can build a model which explains/understands/predicts height for an individual, we can use that model to answer our questions.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
nhanes |> filter(age >= 18) |> drop_na() |> ggplot(aes(x = height)) + geom_histogram(bins = 50) + labs(title = "Adult Heights in the US in 2010", x = "Height (cm)", y = "Count", caption = "Source: National Health and Nutrition Examination Survey", subtitle = "The Average Height of Adults in the US in 2010 was around 162 cm" )
Regardless, the central lesson is always the same: You can never look at your data too much.
What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?
question_text(NULL, message = "Potential covariates are age, sex, race, etc. But you need to include `sex` as a covariate since the question we are asking includes that variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of covariates in the data which we end up using in the model.
What are the treatments, if any, for this problem?
question_text(NULL, message = "There are not treatment variables.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated and, thereby, creating two or more different potential outcomes for each unit.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "Now.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This is often implicit in the question itself. One of our key roles as data scientists is to clarify the questions which we are asked. In this case, it seems clear that the questions refer to now, the present moment.
Define causal effect.
question_text(NULL, message = "A causal effect is the difference between two potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The point of the Rubin Causal Models is that the definition of a causal effect is the difference between two potential outcomes. So, there must be two (or more) potential outcomes for any causal model to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes. But, if the treatment variable is continuous, (like income) then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
What is the fundamental problem of causal inference?
question_text(NULL, message = "The fundamental problem of causal inference is that we can only observe one potential outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A person cannot experience both treatments at the same time. For instance, if you give a person vitamins and record their growth throughout the experiment, you cannot then rewind time, withhold the vitamins, and record their growth again under identical conditions.
How does the motto "No causal inference without manipulation." apply in this problem?
question_text(NULL, message = "The motto does not apply because this is a predictive, not causal, model.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We have to choose a variable that we can change to be the treatment. If we do not have such variable that we can manipulate, then we would have to create a predictive model instead. For example, if we were focused on height, one conclusion maybe: the average height of men is expected to be higher than that of women. Correlation does not mean causation, we cannot assume that sex directly makes people taller. In order to find a causation relationship, we would need to manipulate the treatment so that we can measure its effect on the outcome.
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "The Preceptor has three columns. There is a column for the ID, one for Height, and one for sex. Each row represents one individual.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
#| echo: false tibble(ID = c("Person 1", "Person 2", "...", "Person 45,000", "Person 45,001", "..."), height = c("150", "172", "...", "160", "142", "..."), sex = c("Female", "Male", "...", "Male", "Female", "...")) |> gt() |> cols_label(ID = md("ID"), height = "Height (cm)")
This is just a small portion of the actual Preceptor Table, the real one would be much larger.
Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem.
In analysis.qmd
, load the tidyverse and the primer.data packages in a new code chunk. Label it as the setup code chunk by adding #| label: setup
. Render the file.
Notice that the file does not look good because it is has code that is showing and it also has messages. To take care of this, add #| message: false
to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all code echos from the whole file:
execute: echo: false
In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 6)
Render again. Everything looks nice because we have added code to make the file look better and more professional.
Run glimpse()
on nhanes
.
glimpse(nhanes)
glimpse(nhanes)
The nhanes
data set includes 15 variables, including physical attributes like weight and height. We do not care about most of these variables for our question.
We think of both age and height as numbers. And they are numbers! But R distinguishes between “integers” and “doubles,” only the second of which allow for decimal values. In the nhanes
data, age
is an integer and height
is a double.
Pipe nhanes
to select()
with age
, sex
, and height
as parameters.
nhanes |> select(...)
nhanes |> select(age, sex, height)
This is the distribution of height between male and female.
nhanes |> filter(age >= 18) |> select(age, sex, height) |> drop_na() |> ggplot(aes(height, fill = sex)) + geom_histogram(position = "Identity", alpha = 0.5, bins = 30) + labs(title = "Distribution of Height by Sex", subtitle = "On average, Men are taller than Women", x = "Height (cm)", y = "Number")
Most data sets have some NA values, we have to get rid of these so that we can use the data. Pipe nhanes
to filter()
. Setting age
greater than or equal to 18
. Then, continue the pipe to select()
, putting height
and sex
as arguments. Finally, drop the NA's using drop_na()
.
Copy previous code
nhanes |> filter(age >= ...) |> select(...,...) |> drop_na()
nhanes |> filter(age >= 18) |> select(height, sex) |> drop_na()
We have to clean the data so we can focus on the specific numbers to answer our question. Our question focuses on men and women so we have to narrow down our data to only ones that meet the requirements.
Copy the previous code, continue the pipe with slice_sample()
, set n
equal 50
, and by
equal sex
as arguments. Add set.seed(10)
in a separate line before creating the object.
Copy previous code
set.seed(10) ... |> slice_sample(n = ..., by = ...)
set.seed(10) nhanes |> filter(age >= 18) |> select(height, sex) |> drop_na() |> slice_sample(n = 50, by = sex)
Because we want the undertainty about the true mean of men and women to be large enough to show up in out plot in an interesting way, we pretend that we only have 100 observations in the data, 50 men and 50 women.
In analysis.qmd
, add a new code chunk to the QMD, copy/paste the pipeline above and assign the result to the new object x
:
set.seed(10) x <- nhanes |> filter(age >= 18) |> select(height, sex) |> drop_na() |> slice_sample(n = 50, by = sex)
Command/ctrl + Shift + K
follows. In the Console, run:
show_file("analysis.qmd", start = -5)
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Now that you have an object x
, a subset of nhanes
that has been cleaned and narrowed down to meet the requirement of our question.
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.
Provide one reason why the assumption of validity might not hold for the outcome variable: Height
or for one of the covariates. Use the words "column" or "columns" in your answer.
question_text(NULL, message = "Is the way people measure height in the survey similar to the way it is being measured now -- e.g. Are measurements taken with shoes on or shoes off --. Such issues can cause the `height` column in NHANES in 2010 and the column `height` in the Preceptor Table not refer to the same thing.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.
Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.
The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. It is not connected to the initial "Imagine that you are XX" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.
The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gather, how many observations it includes and the organization (if famous) which collected it.
Type your two sentences below.
question_text(NULL, message = "Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to analysis.qmd
, Command/Ctrl + Shift + K
, and then commit/push.
The arc of the moral universe is long, but it bends toward justice. - Theodore Parker
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The longer the time period covered by the Preceptor Table (and the data), the more suspect the assumption of stability becomes.
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "One possible reason for the failure of the assumption of stability would be if immigrants since 2011 have a different average adult male height. This will change the overall average adult male height in America.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A change in time or the distribution of the data does not, in and of itself, demonstrate a violation of stability. Stability is about the parameters: $\beta_1$, $\beta_1$ and so on. Stability means these parameters are the same in the data as they are in the population as they are in the Preceptor Table.
We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table
. In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.
We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "Since participation in the survey is voluntary, it could mean that, for example, taller men are more likely to answer this question.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
It is important that our data is accurately reflecting the population whom we want to measure. Our data might fail to represent people who do not want to participate in the study for various reasons. If these people have a different average height compared to the overall average, then the lack of their participation will impact our data.
The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.
We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "The population covers the last 20 years or more. The rows from the Preceptor Table are just for today. If the height of men today is systematically different from the height of men over the 20 year period of the Population Table, then representativeness would not be true.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability looks across time periods. Representativeness looks within time periods.
Of course, it is sometimes the case that the Preceptor Table includes every row from the Population Table for that moment in time. In that case, the assumption representativeness is met, by definition, if we only consider that moment. So, in that case, the only possible violation of representativeness must involve a claim that this moment in time is not representative of the rest of the Population Table.
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This assumption is only relevant for causal models. Our current data is not being used for a causal model so this assumption does not matter as much as the other ones.
Write one sentence which highlights a potential weakness in your model. This will almost always be derived from possible problems with the assumptions discussed above. We will add this sentence to our summary paragraph. So far, our version of the summary paragraph looks like this:
Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women.
Of course, your version will be somewhat different.
question_text(NULL, message = "Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women. Since the participation for the survey is voluntary, it could mean that, for example, taller people are more likely to answer this question, thus reduces representativeness of the data.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a weakness sentence to the summary paragraph in your QMD. You can modify your paragraph as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
, and then commit/push.
Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage begins with the exploration and testing of different models. It concludes with the creation of a data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.
Load the tidymodels package.
library(...)
library(tidymodels)
Because height
is a continuous variable, we assume that an adult male's height is produced from a normal distribution.
$$ height \sim Normal(\mu, \sigma^2)$$
Load the gtsummary package.
library(...)
library(gtsummary)
The link function connects the predictor (a combination of the model parameters and independent variables) on the right-hand side to the expected value of the dependent variable.
For a continuous outcome variable, we use a linear model for the link function:
$$\mu = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots$$
Load the equatiomatic package.
library(...)
library(equatiomatic)
This is the basic mathematical structure of our model:
extract_eq(fit_height, wrap = TRUE, show_distribution = TRUE)
Of course, in a normal data science project, we would explore a variety of different combinations of independent variables before selecting a final set. Just pretend that we already did that.
Add library(tidymodels)
, library(gtsummary)
, and library(equatiomatic)
to the setup
code chunk in your QMD. Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tidymodels|gtsummary|equatiomatic")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Recall that a categorical variable (whether character or factor) like sex
is turned into a $0/1$ "dummy" variable which is then re-named something like $sex_{Male}$. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.
Because our outcome variable is continuous, start to create the model by using linear_reg(engine = "lm")
.
linear_reg(...)
linear_reg(engine = "lm")
The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. Imagine you have a variable race
that has three values --- "black", "hispanic", and "white" --- then the model creates two 0/1 dummy variables, giving them names like $race_{hispanic}$ and $race_{white}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied. However, note that there is no variable like this in our model.
Continue the pipe to fit(height ~ sex, data = x)
.
Copy previous code
... |> fit(... ~ sex, ... = x)
linear_reg(engine = "lm") |> fit(height ~ sex, data = x)
We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:
extract_eq(fit_height, intercept = "beta", use_coefs = TRUE)
There are three main differences between this representation of the model and our previous one. First, we replace the parameters with our best estimate of their values. Second, the error term is gone. Third, the dependent variable now has a "hat," indicating that it is our "fitted" value, our best guess as to the value of the outcome, given the values of the independent variables for any given unit.
Behind the scenes of this tutorial, an object called fit_height
has been created which has the result of the code above. Type fit_height
and hit "Run Code". This generates the same results as using print(fit_height)
.
fit_height
fit_height
The code formula includes sex
, a character variable with two possible values: "Male"
and "Female"
.
The math formula includes sexMale
, a 0/1 dummy variable.
In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.
Create a new code chunk in analysis.qmd
. Add two code chunk options: label: model
and cache: true
. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_height
.
Command/Ctrl + Shift + K
. It may take some time to render analysis.qmd
, depending on how complex your model is. But, by including cache: true
you cause Quarto to cache the results of the chunk. The next time you render analysis.qmd
, as long as you have not changed the code, Quarto will just load up the saved fitted object.
To confirm, Command/Ctrl + Shift + K
again. It should be quick.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add *_cache
to .gitignore
file. Commit and push. Cached objects are often large. They don't belong on Github.
We use the code formula to communicate behind the scene with the computer to construct the model. We then use the math formula to communicate with the readers about the result of the model.
Create another code chunk in analysis.qmd
. Add the chunk option: label: math
. In that code chunk, add something like the below. You may find it useful to add the coef_digits
argument to show fewer significant digits after the decimal.
extract_eq(fit_height, intercept = "beta", use_coefs = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "extract")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
When you render your document, this formula will appear.
extract_eq(fit_height, intercept = "beta", use_coefs = TRUE)
This is our data generating mechanism. The code extracts the equation from the fit_height
model, labels the Intercept
as $\beta_0$ and substitute the estimated coefficients from the model to the equation.
Run tidy()
on fit_height
with the argument conf.int
set equal to TRUE
. The returns 95% intervals for all the parameters in our model.
tidy(..., conf.int = ...)
tidy(fit_height, conf.int = TRUE)
tidy()
is part of the broom package, used to summarize information from a wide variety of models.
Because we are Bayesians, we believe that there is a true value and that the confidence or uncertainty interval includes it at the stated level. This is different from the Frequentists' perception that if we followed the same approach in 100 similar problems then, 95% of the time, our confidence interval would include the true value.
tidy(fit_height, conf.int = TRUE) |> select("term", "estimate", "conf.low", "conf.high")
Write a sentence interpreting the estimate for the Intercept.
question_text(NULL, message = "Our (best) guess/estimate for the average height of adult females is 161cm.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Our mathematical formula for this problem is:
extract_eq(fit_height, intercept = "beta")
Recall that we cannot have both "Male" and "Female" categories in the model. Instead, the Female category is incorporated into the intercept and becomes the reference case. Thus, it represents the average height of adult female.
tidy(fit_height, conf.int = TRUE) |> select("term", "estimate", "conf.low", "conf.high")
Write a sentence interpreting the estimate for sexMale
.
question_text(NULL, message = "When comparing the average height of men versus women, our guess/estimate is that men are taller than women by about 14 centimeters.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Note how, whenever we consider non-treatment variables, we must never use terms like "cause," "impact" and so on. We can't make any statement which implies the existence of more than one potential outcome based on changes in non-treatment variables. We can't make any claims about within row effects. Instead, we can only compare across rows. Always use the phrase "when comparing X and Y" or something very similar.
tidy(fit_height, conf.int = TRUE) |> filter(term == "(Intercept)") |> select(conf.low, conf.high) |> round()
Write a sentence interpreting the confidence interval for Intercept.
question_text(NULL, message = "There is a 95% chance that the true value of the Intercept is within the range 159 and 163 cm.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
When looking at the confidence interval, we examine whether it excludes zero. If not, then we can't be sure if the relationship is positive or negative.
tidy(fit_height, conf.int = TRUE) |> filter(term == "sexMale") |> select(conf.low, conf.high) |> round()
Write a sentence interpreting the confidence interval for sexMale.
question_text(NULL, message = "There is a 95% chance that the true value of the coefficient of `sexMale` is within the range 11 and 17 centimeters.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Dummy variables must always be interpreted in the context of the base value for that variable, which is always included in the intercept. For example, the base value here is "Female/Male." (The base value is the first alphabetically by default for character variables. However, if it is a factor variable, you can change that by setting the order of the levels by hand.)
For interactive use, tidy()
is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression()
.
Run tbl_regression(fit_height)
.
tbl_regression(..)
tbl_regression(fit_height)
See this tutorial for a variety of options for customizing your table.
Create a new code chunk in analysis.qmd
. Add a code chunk option: label: table
. Add this code to the code chunk.
tbl_regression(fit_height)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", pattern = "tbl_regression")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add a sentence to your project summary.
Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."
Recall the beginning of our version of the summary:
Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women. Since the participation for the survey is voluntary, it could mean that, for example, taller people are more likely to answer this question, thus reduces representativeness of the data.
question_text(NULL, message = "Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women. Since the participation for the survey is voluntary, it could mean that, for example, taller people are more likely to answer this question, thus reduces representativeness of the data. We model height as a linear function of sex.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K
, and then commit/push.
Temperance is a tree which as for its root very little contentment, and for its fruit calm and peace. - Buddha
In your own words, describe the use of Temperance in data science.
question_text(NULL, message = 'Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.', answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the questions with which we began. We create posteriors for the quantities of interest.
Load the marginaleffects package.
library(...)
library(marginaleffects)
We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
What is the general topic we are investigating? What is the specific question we are trying to answer?
question_text(NULL, message = "We are investigating the topic of height. We then drill to a more specific question: What is the average height of all adult males?", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic broadly.
Enter this code into the exercise code block and hit "Run Code."
plot_predictions(fit_height, condition = "sex")
plot_predictions(fit_height, conditions = "sex")
plot_predictions(fit_height, condition = "sex")
Let's make the plot look more professional. Copy the previous code, add necessary title, subtitle, caption and label for the plot.
Remember this is what your graph should look like:
plot_predictions(fit_height, condition = "sex") + labs(title = "Average Height in Men and Women", subtitle = "On average, men are taller than women by around 14 cm.", x = NULL, y = "Height in cm", caption = "Data from a sample of 100 individuals from NHANES")
Copy previous code
... |> labs(title = ..., subtitle = ..., x = ..., y = ..., caption = ...)
plot_predictions(fit_height, condition = "sex") |> labs(title = "Average Height in Men and Women", subtitle = "On average, men are taller than women by around 14 cm.", x = NULL, y = "Height (cm)", caption = "Data from a sample of 100 individuals from NHANES")
However, there is always uncertainty in our estimate, and with this graph, we cannot communicate our uncertainty level about the difference in average height between men and women.
Let's use plot_comparisons()
to get a good estimate of the height difference between men and women, and the confidence interval thereof. Enter this code into the exercise code block and hit "Run Code."
plot_comparisons(fit_height, variables = "sex", type = "numeric", condition = "sex")
plot_comparisons(fit_height, variables = "sex", type = "numeric", condition = "sex")
plot_comparisons(fit_height, variables = "sex", type = "numeric", condition = "sex")
This prints out our estimate of the difference between men and women's height along with the range in which we are 95% confident that the real value would fall within.
Again, make the plot looks more professional by adding title, subtitle, caption, and labels. Remember this is what your graph should look like.
plot_comparisons(fit_height, variables = "sex", type = "numeric", condition = "sex") + labs(title = "Estimate of height difference between men and women", subtitle = "We are 95% confident that the real value of differences in average height \n lies between 10.7 and 16.8 cm", caption = "Data from a sample of 100 individuals from NHANES", x = NULL, y = "Average height difference (cm)")
Copy previous code
... + labs(title = ..., subtitle = ..., caption = ..., x = ..., y = ...)
plot_comparisons(fit_height, variables = "sex", type = "numeric", condition = "sex") + labs(title = "Height Difference Between Men and Women", subtitle = "We are 95% confident that the real value of the difference in average height \nlies between 10.7 and 16.8 cm", caption = "Data from a sample of 100 individuals from NHANES", x = NULL, y = "Expected height difference (cm)")
Add library(marginaleffects)
to the analysis.qmd
setup code chunk.
Create a new code chunk. Label it with label: plot
. Copy/paste the code which creates your graphic.
Command/Ctrl + Shift + K
to ensure that it all works as intended.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)
question_text(NULL, message = "Men and women vary in height. Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women. Since the participation for the survey is voluntary, it could mean that, for example, taller people are more likely to answer this question, thus reduces representativeness of the data. We model height as a linear function of sex. Our best guest for the average height of women is 161 cm, we are 95% confident that its true value is between 158 and 164 while men, on average, are expected to be taller in women by 14 cm, and we are also 95% confident that the real value of differences in average height lies between 10.7 and 16.8.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
.
As always, note that your estimate depends on all the assumptions in your model being true. But that is a lie! All your assumptions are not true. So, your mean estimate might be wrong, and your confidence intervals are certainly too narrow.
Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.
question_text(NULL, message = "As we would tell our boss, it would not be shocking to find out that the actual average height was less or more than our estimate. This is because a lot of the assumptions we make during the process of building a model, the processes in Wisdom, are subject to error. Perhaps our data did not match the future as well as we had hoped. In such cases, increase the confidence interval since the assumptions of your model are always false.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Rearrange the material in analysis.qmd
so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K
to make sure that everything works.
At the Console, run:
tutorial.helpers::show_file("analysis.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This is the version of your QMD file which your teacher is most likely to take a close look at.
Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the url below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add rsconnect
to the .gitignore
file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.
This tutorial covered Chapter 5: Two Parameters of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
We started the Tutorial with a question:
What is the average height of men and women?
Using data from the National Health and Nutrition Examination Survey conducted from 2009 to 2011 by the Centers for Disease Control and Prevention, we seek to estimate the average height of men and women. Since the participation for the survey is voluntary, it could mean that, for example, taller people are more likely to answer this question, thus reduces representativeness of the data. We model height as a linear function of sex. Our best guest for the average height of women is 161 cm, we are 95% confident that its true value is between 158 and 164 while men, on average, are expected to be taller in women by 14 cm, and we are also 95% confident that the real value of differences in average height lies between 10.7 and 16.8.
plot_predictions(fit_height, condition = "sex") |> labs(title = "Average Height in Men and Women", subtitle = "On average, men are taller than women by around 14 cm.", x = NULL, y = "Height (cm)", caption = "Data from a sample of 100 individuals from NHANES")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.