# XX: First packages are ones which students don't know about or load up in the # Console. learnr and tutorial.helpers are required for the tutorial to work at # all. gt needed because we show a Preceptor Table to the students, which is # built with this package, even though we don't show students how to do so. library(learnr) library(tutorial.helpers) library(gt) # XX: Any package from here is something that we want students to explicitly # load up in the tutorial. This serves two purposes. First, it provides an # occasion for knowledge drops. Second, it reminds students that these packages # must be loaded in the Console if they want the relevant code from the tutorial # to work in the Console. The most common package to be added to this section, # and which should also be loaded by students, is a data source like # primer.data. It should be placed after library(tidyverse) since that is when # it is loaded in the tutorial. # Some models, like ordinal regressions, do not work with tidymodels. So, for # those tutorials, we replace library(tidymodels) with library(MASS), or # whichever package we need to make a model. Obviously, it is nice if fitted # objects using that package work with our usual tools. In general, gtsummary # works with everything and marginaleffects is also ecumenical. equatiomatic can # be more fussy. library(tidyverse) library(tidymodels) library(gtsummary) library(equatiomatic) library(marginaleffects) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") # XX: Never include setup code that takes more than a few seconds to run. See # https://ppbds.github.io/tutorial.helpers/articles/instructions.html#data for # background. # XX: You need to create a version of your model in this setup code chunk. We # use an example here, fit_XX, which you should obviously replace and use your # own name for, but always starting with `fit_`. fit_XX <- linear_reg(engine = "lm") |> fit(att_end ~ sex + treatment + age, data = primer.data::trains)
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
The world confronts us. Make decisions we must.
Imagine that you are ... XX
A prudent question is one half of wisdom. - Francis Bacon The power to question is the basis of all human progress. - Indira Gandhi The important thing is not to stop questioning. - Albert Einstein It is not the answer that enlightens, but the question. - Eugene Ionesco
Load tidyverse package.
library(tidyverse)
library(tidyverse)
Load the primer.data package.
library(primer.data)
library(primer.data)
A version of the data from XX is available in the XX
tibble.
After loading primer.data in your Console, type ?XX
in the Console, and paste in the Description below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
XX is the broad topic of this tutorial. Given that topic, which variable in XX
should we use as our outcome variable?
question_text(NULL, message = "XX: A sentence about the outcome variable which we will be using.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
We will use XX
as our outcome variable.
A data scientist should always look at a plot of the outcome variable.
Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.
Describe this imaginary variable and how might we manipulate its value.
# XX: In your answer, and for the next few questions, always treat this # imaginary variable as real by putting backticks around the name. For example, # with nhanes data, we might imagine a variable called `vitamin` for which `1` # means that the individual ate vitamins growing up and `0` means they did not. # Using the words "treatment group" and "control group" as part of your answer # is often helpful since it reinforces the fact that we are using the Rubin # Causal Model. question_text(NULL, message = "XX: (This is an example answer.) Imagine a variable called `phone_call` which has a value of `1` if the person received a phone call urging them to vote and `0` if they did not receive such a phone call. We, meaning the organization in charge of making such phone calls, can manipulate this variable by deciding, either randomly or otherwise, whether or not we will call a specific individual.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
Given our (imaginary) treatment variable XX
, how many potential outcomes are there for each [XX: unit]? Explain why.
question_text(NULL, message = "There are two potential outcomes because the treatment variable `XX` takes on two posible values: list-the-values-here, i.e., exposure to Spanish-speakers on a train paltform versus no such exposure.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. This is a conceptual framework we apply to the data. It is never inherent in the data itself.
In a few sentences, specify the two different values for the imaginary treatment variable XX
, for a single unit, guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.
# XX: Replace [XX: unit] with a better word below given the actual data set we # are using. Replace all the XX terms as appropriate. # XX: For a given individual, assume that the value of the treatment variables # might be 'exposure to Spanish-speakers' or 'no exposure'. If the individual # gets 'exposure to Spanish-speakers', then her attitude toward immigration # would be 10. If the individual gets 'no exposure', then her attitude would be # 8. The causal effect on the outcome of a treatment of exposure to # Spanish-speakers versus no exposure is 10 - 8 --- i.e., the difference between # two potential outcomes --- which equals 2, which is the causal effect. # XX: If the outcome is a character variable, like Strongly Approve, then there # is no simple metric on which we can pinpoint the causal effect. That is, the # causal effect is still defined --- as, in this example, the difference between # Strongly Approve and Neutral --- but can not be expressed as a number, at # least without further work. question_text(NULL, message = "For a given [XX: unit], assume that the value of the treatment variable might be [XX: treatment] or [XX: control]. If the [XX: unit] gets [XX: treatment], then [XX: the outcome] would be [XX: a number/character]. If the [XX: unit] gets [XX: control], then [XX: the outcome] would be [XX: a different number or character]. The causal effect on the outcome of a treatment of [XX: treatment] versus [XX: control] is [XX: a number] - [XX: a different number] --- i.e., the difference between two potential outcomes --- which equals [XX: the causal effect], which is the causal effect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The the definition of a causal effect is the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.
Any causal connection means exploring the within row difference between two potential outcomes. We don't need to look at any other rows to have that conversation.
Let's consider a predictive model. Which variable in [XX: the tibble]
do you think might have an important connection to XX: the outcome variable
?
question_text(NULL, message = "XX: Describe one of the key covariates whose connection to the outcome variable we might want to explore.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be a treatment variable. We assuming that the values of all covariates are "fixed."
Specify two different groups of [XX: units] which have specific value for [XX: covariate] and which might have different average values for the [XX: outcome].
question_text(NULL, message = "XX: Consider two groups, the first with a value for [XX: covariate] of [XX: a value] and the second with value [XX: a different value]. Those two groups might have different average values for the outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.
Write a [XX: choose causal or predictive] question which connects the outcome variable XX
to XXZ
, the covariate of interest.
# XX: If it is causal, you should use key causal language in the question, like # "What is the causal effect of the treatment on the outcome?" Example: "What is # the causal effect of exposure to Spanish-speakers on attitudes toward # immigration?" If the model is predictive, the question should clearly compare # two groups of units. "What is the difference in the outcome variable between # two groups of units?" Example: "What is the difference in immigration # attitudes between Democrats and Republicans?" In both cases, the word # "average" is implicit in the question. question_text(NULL, message = "XX: Give your question.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We call the answer to this question is your "Quantity of Interest."
Our Quantity of Interest might appear too specific, too narrow to capture the full complexity of the topic. There are many, many numbers which we are interested in, many numbers that we want to know. But we don't need to list them all here! We just need to choose one of them since our goal is just to have a specific number which helps to guide us in the creation of the Preceptor Table and, then, the model.
The only true wisdom is in knowing you know nothing. - Socrates Patience is the companion of wisdom. - Saint Augustine Wonder is the beginning of wisdom. - Socrates Wisdom begins in wonder. - Plato The doorstep to the temple of wisdom is a knowledge of our own ignorance. - Benjamin Franklin It is the province of knowledge to speak, and it is the privilege of wisdom to listen. - Oliver Wendell Holmes Sr. All we can know is that we know nothing. And that’s the height of human wisdom. - Leo Tolstoy
Our question:
XX: Repeat the question with which you ended The Question topic.
In your own words, describe the key components of Wisdom when working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The central problem for Wisdom is: Can we use data from [XX: describe your data] to [XX: pick one of predict or understand or control] the variables/relationships in [XX: describe your Preceptor Table]?
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantity of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will considered a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Create a Github repo called XX
. Make sure to click the "Add a README file" check box.
Connect the Github repo to an R project on your computer. Give the R project the same name.
Select File -> New File -> Quarto Document ...
. Provide a title -- "[XX: Your Title]"
-- and an author (you). Render the document and save it as XX.qmd
.
Edit the .gitignore
by adding *Rproj
. Save and commit this in the Git tab. Push the commit.
In the Console, run:
show_file(".gitignore")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 7)
Remove everything below the YAML header from your QMD and render the file. Command/Ctrl + Shift + K
first saves the file and then renders it.
What are the units for this problem?
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.
What is the outcome variable for this problem?
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Regardless, the central lesson is always the same: You can never look at your data too much.
What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?
question_text(NULL, message = "XX. Answer should be some sensible variables which are plausibly connected to the outcome. Some should be in the data but some shouldn't. You might also add that `You need to have variable X` if the question itself mentions that variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of variables in the data which we end up using in the model.
What are the treatments, if any, for this problem?
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated, thereby, creating two or more different potential outcomes for each unit.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
## XX: Make a nice looking plot which shows the outcome variable on the y-axis ## and the most important/interesting covariate/treatment on the x-axis. This ## doesn't really go here, but we want to show the plot after discussing ## covariates and treatments above. We don't show the code. Plot does not need ## to be fancy, but it should be competent, with a title, a subtitle which ## highlights the main takeaway from the plot, axis labels and so on.
Define a causal effect. [XX: Include this if the model is predictive: (Note that the model in this tutorial is predictive, not causal. We just want to make sure you understand what a causal model is.)]
question_text(NULL, message = "A causal effect is the difference between two potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
According to the Rubin Causal Model, there must be two (or more) potential outcomes for any discussion of causation to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes.
What is the fundamental problem of causal inference?
question_text(NULL, message = "The fundamental problem of causal inference is that we can only observe one potential outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the treatment variable is continuous (like income), then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
How does the motto "No causal inference without manipulation." apply in this problem?
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "XX. Make sure your words given an excellent description of the Preceptor Table which you are about to show the student.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem.
In your QMD, load the tidyverse and the primer.data packages in a new code chunk. Label it as the setup code chunk by adding #| label: setup
. Render the file.
Notice that the file does not look good because the code is visible and there are annoying messages. To take care of this, add #| message: false
to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all code echos from the whole file:
execute: echo: false
In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 6)
Render again. Everything looks nice because we have added code to make the file look better and more professional.
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.
Provide one reason why the assumption of validity might not hold for the outcome variable XX
or for one of the covariates. Use the words "column" or "columns" in your answer.
question_text(NULL, message = "XX: Answers to validity question should always use the word 'column(s)'. You should make your answer longer than the one we expect from students, ideally mentioning potential problems with both the outcome and a covariate.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.
Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.
The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. It is not connected to the initial "Imagine that you are XX" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.
The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gather, how many observations it includes and the organization (if famous) which collected it.
Type your two sentences below.
question_text(NULL, message = "XX; Make your own answer excellent!", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to your QMD, Command/Ctrl + Shift + K
, and then commit/push.
Justice is truth in action. - Benjamin Disraeli The arc of the moral universe is long, but it bends toward justice. - Theodore Parker Justice delayed is justice denied. - William E. Gladstone It is in justice that the ordering of society is centered. - Aristotle Charity is no substitute for justice withheld. - Saint Augustine
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "XX. Again, students read these 'official' answers as closely as anything else you will write. Make your example precise and excellent.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A change in time or the distribution of the data does not, in and of itself, demonstrate a violation of stability. Stability is about the parameters: $\beta_0$, $\beta_1$ and so on. Stability means these parameters are the same in the data as they are in the population as they are in the Preceptor Table.
We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table
. In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.
We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.
# XX: In your answer, try not use of the concept time, even though, in theory, # it is a perfectly reasonable to do so. Instead, focus on why the data might # not be representative of the population at that moment in time. question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.
We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.
# XX: Again, try not use of the concept time. We want to save examples of # changes caused by time for the discussion about stability. Instead, focus on # why the Preceptor might not be representative of the population at that moment # in time. question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability looks across time periods. Representativeness looks within time periods, for the most part.
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.
Provide one reason why the assumption of unconfoundedness might not be true (or relevant) in this case.
question_text(NULL, message = "XX. There is nothing harder for students than coming up with examples of possible confounds. So, your example should be a good one, should specify precisely how treatment assignment is correlated with the potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The great advantage of randomized assignment of treatment is that it guarantees unconfoundedness. There is no way for treatment assignment to be correlated with anything, including potential outcomes, if treatment assignment is random.
Write one sentence which highlights a potential weakness in your model. This will almost always be derived from possible problems with the assumptions discussed above. We will add this sentence to our summary paragraph. So far, our version of the summary paragraph looks like this:
XX: Paste our first two sentences here.
Of course, your version will be somewhat different.
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a weakness sentence to the summary paragraph in your QMD. You can modify your paragraph as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
, and then commit/push.
Courage is found in unlikely places. - J.R.R. Tolkien Courage is being scared to death, but saddling up anyway. - John Wayne Courage is going from failure to failure without losing enthusiasm. - Winston Churchill Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage starts with math, explores models, and then creates the data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.
Load the tidymodels package.
library(...)
library(tidymodels)
The probability family is determined by the outcome variable $Y$.
If $Y$ is a continuous variable, the probability family is Normal, also known as Gaussian.
$$Y \sim N(\mu, \sigma^2)$$ If $Y$ is a binary variable (with exactly two possible values), the probability family is Bernoulli.
$$Y \sim \text{Bernoulli}(\rho)$$
where $\rho$ is the probability that one of the two possible values --- conventionally referred to as 1
or TRUE
--- occurs. By definition, $1 - \rho$ is the probability of the other value.
If $Y$ is a categorical value (with 2+ possible values), the probability family is Multinomial.
$$Y \sim \text{Mutinomial}(\rho_1, \rho_2, \rho_3, \ldots)$$
where $\rho_1 + \rho_2 + \rho_3 + \ldots = 1$.
Load the gtsummary package.
library(...)
library(gtsummary)
The link function, the basic mathematical structure of the model, is (mostly) determined by the type of outcome variable.
For a continuous outcome variable, we use a linear model for the link function:
$$\mu = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots$$
For a binary outcome variable, we use a log-odds model:
$$ \log\left[ \frac { \rho }{ 1 - \rho } \right] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots $$ The link functions for categorical and cumulative variables are also built out of log-odds link functions.
Load the equatiomatic package.
library(...)
library(equatiomatic)
This is the basic mathematical structure of our model:
extract_eq(fit_XX, intercept = "beta", wrap = TRUE, show_distribution = TRUE)
Of course, in a normal data science project, we would explore a variety of different combinations of independent variables before selecting a final set. Just pretend that we already did that.
Load the marginaleffects package.
library(...)
library(marginaleffects)
We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
Add library(tidymodels)
, library(gtsummary)
, library(equatiomatic)
, and library(marginaleffects)
to the setup
code chunk in your QMD. Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "tidymodels|gtsummary|equatiomatic|marginaleffects")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Recall that a categorical variable (whether character or factor) like sex
is turned into a $0/1$ "dummy" variable which is then re-named something like $sex_{Male}$. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.
Because our outcome variable is [XX: binary or continuous or multinomial or cumulative], start to create the model by using [XX: logistic_reg(engine = "glm") or linear_reg(engine = "lm") or multinom_reg(engine = "glmnet")]
.
XX
# XX
The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. If race
has three values --- "black", "hispanic", and "white" --- then the model creates two 0/1 dummy variables, giving them names like $race_{hispanic}$ and $race_{white}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.
Continue the pipe to fit(XX, data = XX)
.
... |> fit(..., data = ...)
# XX |> # fit(XX, data = XX)
We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:
# XX: You might want to use the `coef_digits` argument to show fewer digits # after the decimal. wrap = TRUE is necessary for large models. # extract_eq(fit_XX, # intercept = "beta", # use_coefs = TRUE)
There are three main differences between this representation of the model and our previous one. First, we replace the parameters with our best estimate of their values. Second, the error term is gone. Third, the dependent variable now has a "hat," indicating that it is our "fitted" value, our best guess as to the value of the outcome, given the values of the independent variables for any given unit.
Behind the scenes of this tutorial, an object called fit_XX
has been created which is the result of the code above. Type fit_XX
and hit "Run Code." This generates the same results as using print(fit_XX)
.
fit_XX
## fit_XX
In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.
Create a new code chunk in your QMD. Add two code chunk options: label: model
and cache: true
. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_XX
.
Command/Ctrl + Shift + K
. It may take some time to render your QMD, depending on how complex your model is. But, by including cache: true
you cause Quarto to cache the results of the chunk. The next time you render your QMD, as long as you have not changed the code, Quarto will just load up the saved fitted object.
To confirm, Command/Ctrl + Shift + K
again. It should be quick.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
Add *_cache
to .gitignore
file. Commit and push. Cached objects are often large. They don't belong on Github.
Create another code chunk in your QMD. Add the chunk option: label: math
. In that code chunk, add something like the below. You may find it useful to add the coef_digits
argument to show fewer significant digits after the decimal.
extract_eq(fit_XX, intercept = "beta", use_coefs = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "extract")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
When you render your document, this formula will appear.
# extract_eq(fit_XX, # intercept = "beta", # use_coefs = TRUE)
This is our data generating mechanism.
Run tidy()
on fit_XX
with the argument conf.int
set equal to TRUE
. The returns 95% intervals for all the parameters in our model.
tidy(..., conf.int = ...)
# tidy(fit_XX, conf.int = TRUE)
tidy()
is part of the broom package, used to summarize information from a wide variety of models.
For interactive use, tidy()
is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression()
.
Run tbl_regression(fit_XX)
.
tbl_regression(..)
# tbl_regression(fit_XX)
See this tutorial for a variety of options for customizing your table.
Create a new code chunk in your QMD. Add a code chunk option: label: table
. Add this code to the code chunk.
tbl_regression(fit_XX)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "tbl_regression")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add a sentence to your project summary.
Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."
Recall the beginning of our version of the summary:
XX: Include what we suggested at the end of Justice
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K
, and then commit/push.
Temperance is a tree which as for its root very little contentment, and for its fruit calm and peace. - Buddha Temperance is the greatest of all virtues. It subdues every passion and emotion, and almost creates a Heaven upon Earth. - Joseph Smith Jr. Temperance is a bridle of gold; he, who uses it rightly, is more like a god than a man. - Robert Burton Temperance is the firm and moderate dominion of reason over passion and other unrighteous impulses of the mind. - Marcus Tullius Cicero Temperance to be a virtue must be free, and not forced. - Philip Massinger Temperance is simply a disposition of the mind which binds the passion. - Thomas Aquinas
In your own words, describe the use of Temperance in data science.
question_text(NULL, message = "Temperance uses the data generating mechanism to answer the question with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the question with which we began. We create posteriors for the quantities of interest.
What is the specific question we are trying to answer?
question_text(NULL, message = "XX: This might be the same question as we started with in Wisdom. But it is also OK if it is different. I think . . .", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic more broadly.
At the Console, run this code:
plot_predictions(fit_XX, conditions = c("XX", "XX")))
CP/CR, although note that there will be no response.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Create a new code chunk in your QMD. Label it with label: plot
. Copy/paste the code which creates your graphic.
Command/Ctrl + Shift + K
to ensure that it all works as intended.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
.
Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.
question_text(NULL, message = "XX: This is another example in which the quality of your answer is important. You might or might not suggest an alternate estimate. I always adjust the estimate toward my own subjective sense of a long-run average and/or typical value and/or zero. But that is not necessary. However, you should always increase the confidence interval since the assumptions of your model are always false.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Rearrange the material in your QMD so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K
to ensure that everything works.
At the Console, run:
tutorial.helpers::show_file("XX.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This is the version of your QMD file which your teacher is most likely to take a close look at.
Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the resulting url below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add rsconnect
to the .gitignore
file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.
Copy/paste the url to your Github repo.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.