library(learnr) library(tutorial.helpers) library(gt) library(tidyverse) library(openintro) library(tidymodels) library(ranger) library(marginaleffects) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") # DK: Add a Dear Diary paragraph. loans <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/openintro/loans_full_schema.csv", show_col_types = FALSE) fit_loans_model <- rand_forest( mode = "regression", engine = "ranger", mtry = 1, trees = 1e4, min_n = 5 ) loans_split <- initial_split(loans, prop = 0.50) loans_train <- training(loans_split) loans_test <- testing(loans_split) fit_loans <- fit_loans_model %>% fit(interest_rate ~ paid_interest + paid_principal + inquiries_last_12m + term + total_debit_limit + total_credit_limit, data = loans_train)
ggplot(loans, aes(x = interest_rate, color = term, group = term)) + geom_histogram(aes(y = ..count..), bins = 30, position = "identity", fill = "white", alpha = 0.5) + geom_density(aes(y = ..count..), color = "red") + geom_density(data = subset(loans, term == "60"), aes(y = ..count..), color = "blue") + # scale_color_manual(values = c("36" = "red", "60" = "blue")) + labs(title = "Distribution of Interest Rates by Term", x = "Interest Rate", y = "Count") + theme_classic()
this one uses condition but is similar to the other two plots
loans$term <- as.factor(loans$term) ggplot(loans, aes(x = interest_rate, color = term, group = term)) + geom_histogram(aes(y = ..count..), bins = 30, fill = "white", position = "identity", alpha = 0.5) + geom_density(aes(y = ..count..)) + scale_color_manual(values = c("36" = "red", "60" = "blue")) + labs(title = "Distribution of Interest Rates by Term", x = "Interest Rate", y = "Count") + theme_classic()
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
The world confronts us. Make decisions we must.
Imagine that you want to take out a loan, and are interested in the factors that banks consider when figuring out the interest rate. You want to maximize your chances of getting the lowest interest rate possible.
It is not the answer that enlightens, but the question. - Eugene Ionesco
Load tidyverse package.
library(tidyverse)
library(tidyverse)
This dataset was collected from the Lending Club platform, which allows users to lend money to other individuals. This dataset only represents loans actually made, so don't mistake it for loan applications.
Load the [openintro] package.
library(openintro)
library(openintro)
Load the [ranger] package.
library(ranger)
library(ranger)
After loading openintro in your Console, type ?loans_full_schema
in the Console, and paste in the Description below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This data comes from Lending Club (https://www.lendingclub.com/info/statistics.action), which provides a very large, open set of data on the people who received loans through their platform.
Interest rates are the broad topic of this tutorial. Given that topic, which variable in loans
should we use as our outcome variable?
question_text(NULL, message = "The mosr useful outcome variable is `interest_rate`, which is the interest rate charged for the loan.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
We will use interest_rate
as our outcome variable.
loans |> filter(!is.na(interest_rate) & is.finite(interest_rate)) |> # Removes NA and Inf values ggplot(aes(x = interest_rate)) + geom_histogram(bins = 80, fill = "lightblue", color = "black") + geom_density(color = "red", size = 1) + scale_x_continuous( limits = c(0, max(loans$interest_rate, na.rm = TRUE)), # Force x-axis to start at 0 breaks = seq(0, max(loans$interest_rate, na.rm = TRUE), by = 2), labels = function(x) paste0(x, "%"), expand = c(0, 0) # Ensures the axis starts exactly at 0 ) + theme_minimal() + labs(x = "Interest Rate", y = "Count", title = "Count of Interest Rates Used for Loans", subtitle = "Note the absence of interest rates below 5%", caption = "Source: Lending Club at https://www.lendingclub.com/info/statistics.action")
A data scientist should always look at a plot of the outcome variable.
Let's imagine a brand new variable which does not exists in the data. This variable should be binary, meaning that it only takes on one of two values. It should also, at least in theory, by manipulable. In other words, if the value of the variable is "X," or whatever, then it generates one potential outcome and if it is "Y," or whatever, it generates another potential outcome.
Describe this imaginary variable and how might we manipulate its value.
question_text(NULL, message = "Imagine a variable called `debt` which has a value of `1` if the person has debt, and `0` have no debt. We, meaning the organization in charge of making such phone calls, can tell randomly selected individuals that they have, 10,000 in debt making half the people we call have debt whil the others dont. Knowing this, we can tell them to take out loans and see the causal difference of debt with the data we collect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any data set can be used to construct a causal model as long as there is at least one covariate that we can, at least in theory, manipulate. It does not matter whether or not anyone did, in fact, manipulate it.
Given our (imaginary) treatment variable debt
, how many potential outcomes are there for each person? Explain why.
question_text(NULL, message = "There are two potential outcomes because the treatment variable `debt` takes on two possible values: no debt, or having debt.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The same data set can be used to create, separately, lots and lots of different models, both causal and predictive. We can just use different outcome variables and/or specify different treatment variables. This is a conceptual framework we apply to the data. It is never inherent in the data itself.
In a few sentences, specify the two different values for the imaginary treatment variable debt
, for a single unit, guess at the potential outcomes which would result, and then determine the causal effect for that unit given those guesses.
question_text(NULL, message = "For a given person, assume that the value of the treatment variable might be debt] or no debt. If the person gets debt, then interest rate would be higher, like 14%. If the person gets no debt, then the interest rate would be 8%. The causal effect on the outcome of a treatment of debt versus no debt is 14 - 8 --- i.e., the difference between two potential outcomes --- which equals 6(%), which is the causal effect.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The the definition of a causal effect is the difference between two potential outcomes. Of course, you can't just say that the causal effect is 10. The exact value depends on which potential outcome comes first in the subtraction and which second. There is, perhaps, a default sense in which the causal effect is defined as treatment minus control.
Any causal connection means exploring the within row difference between two potential outcomes. We don't need to look at any other rows to have that conversation.
Let's consider a predictive model. Which variable in loans
do you think might have an important connection to interest_rate
?
question_text(NULL, message = "Really any of the covariates would work, but the ones we will be exploring in this tutorial will be paid_interest, paid_principal, inquiries_last_12m, term, total_debit_limit, and total_credit_limit", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 2)
With a predictive model, there is only one outcome for each individual unit. There are not two potential outcomes because we are not considering any of the covariates to be a treatment variable. We assuming that the values of all covariates are "fixed."
Specify two different groups of loans which have specific value for term
and which might have different average values for the interest_rate
.
question_text(NULL, message = "Consider two groups, the first with a value for term of 30 and the second with value 60. Those two groups might have different average values for the outcome variable.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In predictive models, do not use words like "cause," "influence," "impact," or anything else which suggests causation. The best phrasing is in terms of "differences" between groups of units with different values for the covariate of interest.
Write a predictive question which connects the outcome variable interest_rate
to term
, the covariate of interest.
question_text(NULL, message = "What is the difference in interest rates between loans with a term of 30 months versus loans with a term of 60 months", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The same can be said for paid_interest, paid_principal, inquiries_last_12m, total_debit_limit, and total_credit_limit, but since they are quantitative data, we will use terms, the qualitative variable to make it easier.
All we can know is that we know nothing. And that’s the height of human wisdom. - Leo Tolstoy
Our question:
What is the difference in interest rates between loans with a term of 30 months versus loans with a term of 60 months
In your own words, describe the key components of Wisdom when working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The central problem for Wisdom is: Can we use data from [XX: describe your data] to [XX: pick one of predict or understand or control] the variables/relationships in [XX: describe your Preceptor Table]?
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantity of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table does not include all the covariates which you will eventually include in your model. It only includes covariates which you need to answer your question.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will considered a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This problem is predictive so we will compare the average interest rates for loans with a term of 30 months and ones with a term of 60 months.
Create a Github repo called loans
. Make sure to click the "Add a README file" check box.
Connect the Github repo to an R project on your computer. Give the R project the same name.
Select File -> New File -> Quarto Document ...
. Provide a title -- "Loans"
-- and an author (you). Render the document and save it as XX.qmd
.
Edit the .gitignore
by adding *Rproj
. Save and commit this in the Git tab. Push the commit.
In the Console, run:
show_file(".gitignore")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 7)
Remove everything below the YAML header from your QMD and render the file. Command/Ctrl + Shift + K
first saves the file and then renders it.
What are the units for this problem?
question_text(NULL, message = "Loans", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Specifying the Preceptor Table forces us to think clearly about the units and outcomes implied by the question. The resulting discussion sometimes leads us to modify the question with which we started. No data science project follows a single direction. We always backtrack. There is always dialogue.
What is the outcome variable for this problem?
question_text(NULL, message = "Interest_rate", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Regardless, the central lesson is always the same: You can never look at your data too much.
What are some covariates which you think might be useful for this problem, regardless of whether or not they might be included in the data?
question_text(NULL, message = "Paid interest, paid principal, inquiries in last 12m, total debit limit, total_credit_limit, past debt, credit score, etc. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Many answers can be said here, as long as they include the ones we include in the machine learning process.
The term "covariates" is used in at least three ways in data science. First, it is all the variables which might be useful, regardless of whether or not we have the data. Second, it is all the variables for which we have data. Third, it is the set of variables in the data which we end up using in the model.
What are the treatments, if any, for this problem?
question_text(NULL, message = "N/a", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Remember that a treatment is just another covariate which, for the purposes of this specific problem, we are assuming can be manipulated, thereby, creating two or more different potential outcomes for each unit.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "WE care about the present since we are looking to take out a loan with the lowest interest rate in current time.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
loans$term <- as.factor(loans$term) ggplot(loans, aes(x = interest_rate, color = term, group = term)) + geom_histogram(aes(y = ..count..), bins = 30, fill = "white", position = "identity", alpha = 0.5) + geom_density(aes(y = ..count..)) + scale_color_manual(values = c("36" = "red", "60" = "blue")) + labs(title = "Distribution of Interest Rates by Term", x = "Interest Rate", y = "Count") + theme_classic()
Define a causal effect. Note that this is still a predictive model
question_text(NULL, message = "A causal effect is the difference between two potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
According to the Rubin Causal Model, there must be two (or more) potential outcomes for any discussion of causation to make sense. This is simplest to discuss when the treatment only has two different values, thereby generating only two potential outcomes.
What is the fundamental problem of causal inference?
question_text(NULL, message = "The fundamental problem of causal inference is that we can only observe one potential outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the treatment variable is continuous (like income), then there are lots and lots of potential outcomes, one for each possible value of the treatment variable.
How does the motto "No causal inference without manipulation." apply in this problem?
question_text(NULL, message = "This is a predictive model, so it does not apply in this problem.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "The Preceptor table should have a rows for id, all the covariates needed and the outcome.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
```r --> tibble(rownames = c("1", "2", "...", "10", "11", "..."), paid_interest = c("1015.9", "150.49", "...", "68.28", "908.46", "..."), paid_principal = c("984.4", "348.63", "...", "322.87", "856.49", "..."), total_credit_limit = c("70785", "28800", "...", "33114", "39433", "...",) inquiries_last_12m = c("6", "1", "...", "0", "4", "..."), term = c("60", "36", "...", "36", "60", "..."), --> total_debit_limit = c("11100", "16500", "...", "14700", "5000", "..."), |> interest_rate = c("14,07", "12.61", "...", "6.71", "15.05", "...")
gt() |> -->
tab_header(title = "Preceptor Table") |>
cols_label(ID = md("rownames"), -->
paid_interest = md("Paid Interest"),
paid_principal = md("Paid Principal"),
total_credit_limit = md("Total Credt Limit"),
inquiries_last_12m = md("Inquiries Last 12m"),
term = md("Term"), -->
total_debit_limit = md("Total Debit Limit"),
interest_rate = md("Interest Rate")) |>
tab_style(cell_borders(sides = "right"),
location = cells_body(columns = c(ID))) |>
tab_style(style = cell_text(align = "left", v_align = "middle", size = "large"),
locations = cells_column_labels(columns = c(rownames))) |>
cols_align(align = "center", columns = everything()) |>
cols_align(align = "left", columns = c(rownames)) |>
fmt_markdown(columns = everything()) |>
tab_spanner(label = "Outcome", columns = c(interest_rate)) |>
tab_spanner(label = "Covariate", columns = c(paid_interest, paid_principal,total_credit_limit, inquiries_last_12m, term, total_debit_limit))
Like all aspects of a data science problem, the Preceptor Table evolves as we work on the problem. ### Exercise 14 <!-- SU: has loans been added to primer.data? DK: Please check. openintro?--> <!-- SU: its in open intro but as loans_full_schema, and not the shortened loans name we gave it. --> <!-- XX: Switch primer.data or whichever package you got your data from. --> In your QMD, load the **tidyverse** and the **openintro** packages in a new code chunk. Label it as the setup code chunk by adding `#| label: setup`. Render the file. Notice that the file does not look good because the code is visible and there are annoying messages. To take care of this, add `#| message: false` to remove all the messages in the setup chunk. Also add the following to the YAML header to remove all code echos from the whole file:
execute: echo: false
In the Console, run:
show_file("analysis.qmd", start = -5)
CP/CR. ```r question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 6)
Render again. Everything looks nice because we have added code to make the file look better and more professional.
Command/ctrl + Shift + K
follows, perhaps with a show_file("analysis.qmd", start = -5)
-->
In your QMD, add a new code chunk below your libraries. Pipe loans_full_schema
to loans
. Like this, loans <- loans_full_schema. We will refer to the data set as loans to make it simpler.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Validity is always about the columns in the Preceptor Table and the data. Just because columns from these two different tables have the same name does not mean that they are the same thing.
Provide one reason why the assumption of validity might not hold for the outcome variable XX
or for one of the covariates. Use the words "column" or "columns" in your answer.
question_text(NULL, message = "The columns may not hold to be accurate because the loans dataset are loans that have been accepted in the past, while the columns of the preceptor table are rows that correspond to the present. The future is ever changing", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In order to consider the Preceptor Table and the data to be drawn from the same population, the columns from one must have a valid correspondence with the columns in the other. Validity, if true (or at least reasonable), allows us to construct the Population Table, which is the first step in Justice.
Because we control the Preceptor Table and, to a lesser extent, the original question, we can adjust those variables to be “closer” to the data that we actually have. This is another example of the iterative nature of data science. If the data is not close enough to the question, then we check with our boss/colleague/customer to see if we can modify the question in order to make the match between the data and the Preceptor Table close enough for validity to hold.
Over the course of this tutorial, we will be creating a summary paragraph. The purpose of this exercise is to write the first two sentences of that paragraph.
The first sentence is a general statement about the overall topic, mentioning both general class of the outcome variable and of at least one of the covariates. It is not connected to the initial "Imagine that you are XX" which set the stage for this project. That sentence can be rhetorical. It can be trite, or even a platitude. The purpose of the sentence to let the reader know, gently, about our topic.
The second sentence does two things. First, it introduces the data source. Second, it introduces the specific question. The sentence can't be that long. Important aspects of the data include when/where it was gather, how many observations it includes and the organization (if famous) which collected it.
Type your two sentences below.
question_text(NULL, message = "Taking out loans through other people is a niche way to borrow money. Using data from the Lending Club dataset, we seek to predict the interest rate of a loans based on other factors, such as the term to pay it back in, and their total credit and debit limit.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to your QMD, Command/Ctrl + Shift + K
, and then commit/push.
It is in justice that the ordering of society is centered. - Aristotle
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Justice is about concerns that you (or your critics) might have, reasons why the model you create might not work as well as you hope.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Population Table is almost always much bigger than the combination of the Preceptor Table and the data because, if we can really assume that both the Preceptor Table and the data are part of the same population, than that population must cover a broad universe of time and units since the Preceptor Table and the data are, themselves, often far apart from each other.
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The longer the time period covered by the Preceptor Table (and the data), the more suspect the assumption of stability becomes.
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "The relationship between the Preceptor Table and dataset may cover a long enough period of time in which many changes have occurred. The assumption of stability might not be true in this case, if incidently the timwe make our Preceptor table is in of rapid inflation, and the dataset is from a period of economical stability.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A change in time or the distribution of the data does not, in and of itself, demonstrate a violation of stability. Stability is about the parameters: $\beta_0$, $\beta_1$ and so on. Stability means these parameters are the same in the data as they are in the population as they are in the Preceptor Table.
We use our data to make inferences about the overall population. We use information about the population to make inferences about the Preceptor Table: Data -> Population -> Preceptor Table
. In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationships among the rows in the Population Table. The first is between the data and the other rows. The second is between the other rows and the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Ideally, we would like both the Preceptor Table and our data to be random samples from the population. Sadly, this is almost never the case.
We do not use the data, directly, to estimate missing values in the Preceptor Table. Instead, we use the data to learn about the overall population. Provide one reason, involving the relationship between the data and the population, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "The geographical area of where these people are taking loans out may not match up. If the dataset is mostly of people taking out a loan from the Europe area, it might not match up with the Preceptor table, which might be looking at people taking out loans in Asia and Oceania.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The reason that representativeness is important is because, when it is violated, the estimates for the model parameters might be biased.
We use information about the population to make inferences about the Preceptor Table. Provide one reason, involving the relationship between the population and the Preceptor Table, for why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability looks across time periods. Representativeness looks within time periods, for the most part.
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This assumption is only relevant for causal models. We describe a model as "confounded" if this is not true. The easiest way to ensure unconfoundedness is to assign treatment randomly.
Provide one reason why the assumption of unconfoundedness might not be true (or relevant) in this case.
question_text(NULL, message = "XX. There is nothing harder for students than coming up with examples of possible confounds. So, your example should be a good one, should specify precisely how treatment assignment is correlated with the potential outcomes.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The great advantage of randomized assignment of treatment is that it guarantees unconfoundedness. There is no way for treatment assignment to be correlated with anything, including potential outcomes, if treatment assignment is random.
Write one sentence which highlights a potential weakness in your model. This will almost always be derived from possible problems with the assumptions discussed above. We will add this sentence to our summary paragraph. So far, our version of the summary paragraph looks like this:
XX: Paste our first two sentences here.
Of course, your version will be somewhat different.
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a weakness sentence to the summary paragraph in your QMD. You can modify your paragraph as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
, and then commit/push.
Courage is found in unlikely places. - J.R.R. Tolkien Courage is being scared to death, but saddling up anyway. - John Wayne Courage is going from failure to failure without losing enthusiasm. - Winston Churchill Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage starts with math, explores models, and then creates the data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
A statistical model consists of two parts: the probability family and the link function. The probability family is the probability distribution which generates the randomness in our data. The link function is the mathematical formula which links our data to the unknown parameters in the probability distribution.
Load the tidymodels package.
library(...)
library(tidymodels)
The probability family is determined by the outcome variable $Y$.
If $Y$ is a continuous variable, the probability family is Normal, also known as Gaussian.
$$Y \sim N(\mu, \sigma^2)$$ If $Y$ is a binary variable (with exactly two possible values), the probability family is Bernoulli.
$$Y \sim \text{Bernoulli}(\rho)$$
where $\rho$ is the probability that one of the two possible values --- conventionally referred to as 1
or TRUE
--- occurs. By definition, $1 - \rho$ is the probability of the other value.
If $Y$ is a categorical value (with 2+ possible values), the probability family is Multinomial.
$$Y \sim \text{Mutinomial}(\rho_1, \rho_2, \rho_3, \ldots)$$
where $\rho_1 + \rho_2 + \rho_3 + \ldots = 1$.
Load the marginaleffects package.
library(...)
library(marginaleffects)
We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
Add library(tidymodels)
, and library(marginaleffects)
to the setup
code chunk in your QMD. Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "tidymodels|gtsummary|equatiomatic|marginaleffects")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Recall that a categorical variable (whether character or factor) like sex
is turned into a $0/1$ "dummy" variable which is then re-named something like $sex_{Male}$. After all, we can't have words --- like "Male" or "Female" --- in a mathematical formula, hence the need for dummy variables.
Because our outcome variable is [XX: binary or continuous or multinomial or cumulative], start to create the model by using [XX: logistic_reg(engine = "glm") or linear_reg(engine = "lm") or multinom_reg(engine = "glmnet")]
.
XX
# XX
The same approach applies to a categorical covariate with $N$ values. Such cases produce $N-1$ dummy $0/1$ variables. The presence of an intercept in most models means that we can't have $N$ categories. The "missing" category is incorporated into the intercept. If race
has three values --- "black", "hispanic", and "white" --- then the model creates two 0/1 dummy variables, giving them names like $race_{hispanic}$ and $race_{white}$. The results for the first category are included in the intercept, which becomes the reference case, relative to which the other coefficients are applied.
Continue the pipe to fit(XX, data = XX)
.
... |> fit(..., data = ...)
# XX |> # fit(XX, data = XX)
We can translate the fitted model into mathematics, including the best estimates of all the unknown parameters:
# XX: You might want to use the `coef_digits` argument to show fewer digits # after the decimal. wrap = TRUE is necessary for large models. # extract_eq(fit_XX, # intercept = "beta", # use_coefs = TRUE)
There are three main differences between this representation of the model and our previous one. First, we replace the parameters with our best estimate of their values. Second, the error term is gone. Third, the dependent variable now has a "hat," indicating that it is our "fitted" value, our best guess as to the value of the outcome, given the values of the independent variables for any given unit.
Behind the scenes of this tutorial, an object called fit_XX
has been created which is the result of the code above. Type fit_XX
and hit "Run Code." This generates the same results as using print(fit_XX)
.
fit_XX
# fit_XX
In data science, we deal with words, math, and code, but the most important of these is code. We created the mathematical structure of the model and then wrote a model formula in order to estimate the unknown parameters.
Create a new code chunk in your QMD. Add two code chunk options: label: model
and cache: true
. Copy/paste the code from above for estimating the model into the code chunk, assigning the result to fit_XX
.
Command/Ctrl + Shift + K
. It may take some time to render your QMD, depending on how complex your model is. But, by including cache: true
you cause Quarto to cache the results of the chunk. The next time you render your QMD, as long as you have not changed the code, Quarto will just load up the saved fitted object.
To confirm, Command/Ctrl + Shift + K
again. It should be quick.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
Add *_cache
to .gitignore
file. Commit and push. Cached objects are often large. They don't belong on Github.
Create another code chunk in your QMD. Add the chunk option: label: math
. In that code chunk, add something like the below. You may find it useful to add the coef_digits
argument to show fewer significant digits after the decimal.
extract_eq(fit_XX, intercept = "beta", use_coefs = TRUE)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "extract")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
When you render your document, this formula will appear.
# extract_eq(fit_XX, # intercept = "beta", # use_coefs = TRUE)
This is our data generating mechanism.
Run tidy()
on fit_XX
with the argument conf.int
set equal to TRUE
. The returns 95% intervals for all the parameters in our model.
tidy(..., conf.int = ...)
# tidy(fit_XX, conf.int = TRUE)
tidy()
is part of the broom package, used to summarize information from a wide variety of models.
For interactive use, tidy()
is very handy. But, for presenting our results, we should use a presentation package like gtsummary, which includes handy functions like tbl_regression()
.
Run tbl_regression(fit_XX)
.
tbl_regression(..)
# tbl_regression(fit_XX)
See this tutorial for a variety of options for customizing your table.
Create a new code chunk in your QMD. Add a code chunk option: label: table
. Add this code to the code chunk.
tbl_regression(fit_XX)
Command/Ctrl + Shift + K
.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", pattern = "tbl_regression")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add a sentence to your project summary.
Explain the structure of the model. Something like: "I/we model Y [the concept of the outcome, not the variable name] as a [linear/logistic/multinomial/ordinal] function of X [and maybe other covariates]."
Recall the beginning of our version of the summary:
XX: Include what we suggested at the end of Justice
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Read our answer. It will not be the same as yours. You can, if you want, change your answer to incorporate some of our ideas. Do not copy/paste our answer exactly. Add your two sentences, edited or otherwise, to summary paragrah portion of your QMD. Command/Ctrl + Shift + K
, and then commit/push.
Temperance is a tree which as for its root very little contentment, and for its fruit calm and peace. - Buddha Temperance is the greatest of all virtues. It subdues every passion and emotion, and almost creates a Heaven upon Earth. - Joseph Smith Jr. Temperance is a bridle of gold; he, who uses it rightly, is more like a god than a man. - Robert Burton Temperance is the firm and moderate dominion of reason over passion and other unrighteous impulses of the mind. - Marcus Tullius Cicero Temperance to be a virtue must be free, and not forced. - Philip Massinger Temperance is simply a disposition of the mind which binds the passion. - Thomas Aquinas
In your own words, describe the use of Temperance in data science.
question_text(NULL, message = "Temperance uses the data generating mechanism to answer the question with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage gave us the data generating mechanism. Temperance guides us in the use of the DGM — or the “model” — we have created to answer the question with which we began. We create posteriors for the quantities of interest.
What is the specific question we are trying to answer?
question_text(NULL, message = "XX: This might be the same question as we started with in Wisdom. But it is also OK if it is different. I think . . .", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Data science projects almost always begin with a broad topic of interest. Yet, in order to make progress, we need to drill down to a specific question. This leads to the creation of a data generating mechanism, which can now be used to answer lots of questions, thus allowing us to explore the original topic more broadly.
At the Console, run this code:
plot_predictions(fit_XX, conditions = c("XX", "XX")))
CP/CR, although note that there will be no response.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Create a new code chunk in your QMD. Label it with label: plot
. Copy/paste the code which creates your graphic.
Command/Ctrl + Shift + K
to ensure that it all works as intended.
At the Console, run:
tutorial.helpers::show_file("XX.qmd", start = -8)
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Write the last sentence of your summary paragraph. It describes at least one quantity of interest (QoI) and provides a measure of uncertainty about that QoI. (It is OK if this QoI is not the one that you began with. The focus of a data science project often changes over time.)
question_text(NULL, message = "XX", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Add a final sentence to your summary paragraph in your QMD as you see fit, but do not copy/paste our answer exactly. Command/Ctrl + Shift + K
.
Write a few sentences which explain why the estimates for the quantities of interest, and the uncertainty thereof, might be wrong. Suggest an alternative estimate and confidence interval, if you think either might be warranted.
question_text(NULL, message = "XX: This is another example in which the quality of your answer is important. You might or might not suggest an alternate estimate. I always adjust the estimate toward my own subjective sense of a long-run average and/or typical value and/or zero. But that is not necessary. However, you should always increase the confidence interval since the assumptions of your model are always false.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Rearrange the material in your QMD so that the order is graphic, paragraph, math and table. Doing so, of course, requires sensible judgment. For example, the code chunk which creates the fitted model must occur before the chunk which creates the graphic. Command/Ctrl + Shift + K
to ensure that everything works.
At the Console, run:
tutorial.helpers::show_file("XX.qmd")
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This is the version of your QMD file which your teacher is most likely to take a close look at.
Publish your rendered QMD to Rpubs. Choose a sensible slug. Copy/paste the resulting url below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Add rsconnect
to the .gitignore
file. You don't want your personal Rpubs details stored in the clear on Github. Commit/push everything.
Copy/paste the url to your Github repo.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This tutorial supports Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.