library(learnr) library(tutorial.helpers) library(tidyverse) library(gt) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") # Key Data set.seed(10) urn <- tibble(color = c(rep("red", 400), rep("white", 600))) |> sample_frac() |> mutate(bead_ID = 1:1000) virtual_samples <- tibble(trial_ID = 1:1000) |> mutate(shovel = map(trial_ID, ~ sample_n(urn, size = 50))) |> mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) |> mutate(numb_beads = map_int(shovel, ~ length(.$color))) |> mutate(prop_red = numb_red / numb_beads) shovels_100 <- expand_grid(trial_ID = 1:100, shovel_size = 1:100) |> mutate(shovel = map(shovel_size, ~ slice_sample(urn, n = .))) |> mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) |> mutate(prop_red = numb_red / shovel_size) |> summarize(st_dev_p_hat = sd(prop_red), .by = shovel_size)
This tutorial covers Chapter 3: Sampling of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane. In this chapter, we will learn about sampling, the beginning of our journey toward inference. When we sample, we take some units from a population. With the data collected via sampling, we will create statistical models with just one unknown parameter. With such models, we can make inferences in order to answer questions. Along that journey, we learn to use the Cardinal Virtues to guide our actions.
In Chapter 3: Sampling of the Primer, we described a tactile sampling activity. We used a physical urn of beads and a physical shovel. We did this by hand so that we could develop our intuition about the ideas behind sampling. In this section, we mimic this physical sampling with virtual sampling, using a computer.
We begin this chapter with a specific problem:
Given an urn consists of 1,000 identically sized red and white beads in the urn, mixed well together. What proportion, $p$, of this urn's beads are red?
In this section, we will create the following plot to answer this question.
plot_vs <- virtual_samples |> ggplot(aes(x = prop_red)) + geom_histogram(binwidth = 0.02, boundary = 0.4, color = "white") + labs(x = expression(hat(p)), y = "Count", title = "Distribution of 1,000 proportions red") plot_vs
Define the population in this problem.
question_text(NULL, message = "In our sampling activities, the population is the collection 1,000 identically sized red and white beads in the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The urn is basically a container where we put all the beads in. We will be using the two terms "urn" and "population" interchangeably because they both mean the same thing in this example.
Just by counting each bead in the urn, explain how we can answer the question we have.
question_text(NULL, message = "We can answer our question by counting the number of red beads out of the total 1,000 beads in the urn and then computing the proportion of red beads.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We refer to this method as a census which, assuming zero error when counting, will return the exact portion of red beads in the urn. However, it can be quite expensive in terms of time, energy and money, or even impossible when the population is large.
In your own words, describe the act of sampling and how it relates to our activity with the beads.
question_text(NULL, message = "Sampling is the act of collecting a sample from the population. In our sampling activity, we use the shovels to collect a certain number of beads from the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
What if we did not insert a shovel deep enough that only beads in the surface were collected, or what if the white beads are slightly smoother than the red beads that they just kept falling out of the shovel, or what if someone put in some beads before we take another shovel? Similar problems like these will affect the quality of our sample, which will be discussed shortly.
In your own words, define what is necessary for a sample to be representative in our sampling activity?
question_text(NULL, message = 'Within a sample, the proportion of the red and white beads need to look "roughly" like the distribution inside the urn.', answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Consider the case in which we insert the shovel on the surface of the urn and take out a sample of beads that contains 40 red beads and only 10 white beds, can we say that the sample of red and white beads is representative of the beads in the urn? The answer is no.
In your own words, define what is necessary for a sample to be generalizable in our activity?
question_text(NULL, message = "The sample is generalizable if any results based on the sample of 50 beads can generalize to the broader population of 1,000 beads in the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We take a sample and compute the proportion of red beads, which is our guess of the proportion of red beads within the urn. This question considers whether the proportion we just computed from the sample is a "good guess".
In your own words, define biased sampling and how it might apply to our activity.
question_text(NULL, message = "Our sampling activity is biased if certain beads in the urn are more likely to get in the shovel, or in other words, if every bead in the urn does not get a fair chance of being selected.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We can tell that our guess is a good guess by sampling several times and noticing how the results vary slightly around our estimate. Sadly, multiple biased samplings can also give us the same pattern, but obviously, this result is incorrect.
Provide one example for this problem where the sampling could be biased.
question_text(NULL, message = "Had the red beads been much smaller than the white beads, and therefore more prone to falling out of the shovel, our sample would have been biased.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Note that all of the above discussion refers back to our sampling activity and the way we do it. Such small changes in this process can cause lots of changes in the outcome.
In your own words, define when a sampling procedure is random in this specific problem.
question_text(NULL, message = "The sampling procedure is random if we sample randomly from the population in an unbiased fashion. In our sampling activity, this would correspond to sufficiently mixing the urn before each use of the shovel and assuring that the red and white beads are identitical save for their differing colours.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
How can we tell if the urn is "sufficiently mixed"? Can we ensure that the urn is sufficiently mixed in the same way each round? The answer is no and thus we can never fully assume randomness in our sampling.
Create a urn
variable on the second line that is set to a tibble()
. Within the tibble()
set color
to the combination of rep("red", 400)
and rep("white", 600)
.
set.seed(10)
set.seed(10) urn <- tibble(... = c(rep("...", ...), rep("...", ...)))
set.seed(10) urn <- tibble(color = c(rep("red", 400), rep("white", 600)))
set.seed()
ensures that the beads in our virtual urn are always in the same order. This ensures that the figures in the book match their written descriptions. We want 40% of the beads to be red. The rep()
function will repeat the first argument a number of times specified by the second argument. We then combine our 400 red beads and 600 white beads using c()
.
With the urn
variable start a pipe into sample_frac()
.
urn <- ... |> sample_frac()
urn <- tibble(color = c(rep("red", 400), rep("white", 600))) |> sample_frac()
sample_frac()
keeps all the rows in the tibble but rearranges their order. We don't need to do this. A virtual urn does not care about the order of the beads. But we find it aesthetically pleasing to mix them up.
Finish the pipe with mutate()
and set bead_ID
to 1 through 1000.
urn <- ... |> mutate(... = 1:1000)
urn <- tibble(color = c(rep("red", 400), rep("white", 600))) |> sample_frac() |> mutate(bead_ID = 1:1000)
The first variable bead_ID
is used as an identification variable. None of the beads in the actual urn are marked with numbers. The second variable color
indicates whether a particular virtual bead is red or white.
Run the command slice_sample()
with the first argument is urn
and the second argument is n = 1
. Hit "Run Code"
slice_sample(..., ... = 1)
slice_sample(urn, n = 1)
The slice_sample()
function is like the shovel in real sampling in which it goes into the urn
and takes out a desired number of samples. In this case, as we set n
equals to 1, the output creates a 1x1 tibble.
Run the above command again, but this time change n
to 10
. Hit "Run Code"
slice_sample(..., ... = 10)
slice_sample(urn, n = 1)
Setting n
equal 10 means that we want to take a sample of 10 beads from the urn
and thus it will return a 10x1 tibble. Note that each time we run this command, we always get a different tibble, this is exactly similar to inserting the shovel into the real urn to take out samples several times.
To simulate the process of real-world sampling, let’s take a sample of 50 beads from our virtual urn. To do so, create a tibble()
that has one variable trial_ID
that takes on the values 1 to 1000.
tibble(trial_ID = ...)
tibble(trial_ID = 1:1000)
By this we are actually creating 1,000 virtual urns that we can take the sample from. For each urn, we sampling to get 50 beads and calculate the proportion of red beads per each sample. Note that if we conduct 10,000 trials as in Chapter 3, we will notice some extreme values (can be much higher or lower) of the proportion of red beads.
Now pipe your results to the function mutate()
to create the variable shovel
, which is set to the function map()
. The first argument to map()
is trial_ID
. The second argument is slice_sample(urn, n = 50)
... |> mutate(shovel = map(..., ~ ...))
tibble(trial_ID = 1:1000) |> mutate(shovel = map(trial_ID, ~ slice_sample(urn, n = 50)))
The slice_sample()
helps us take a sample of 50 beads from the urn, we then use the map()
function to map that result to each virtual urn we created.
Continue your pipe with mutate()
to create the variable numb_red
. Set numb_red
to the function map_int()
. The first argument to map_int()
should be shovel
. The second argument should take the sum()
of where .$color
is equal to "red"
.
... |> mutate(... = map_int(..., ~...))
... |> mutate(... = map_int(..., ~ sum(.$color == "red")))
tibble(trial_ID = 1:1000) |> mutate(shovel = map(trial_ID, ~ slice_sample(urn, n = 50))) |> mutate(numb_red = map_int(shovel, ~sum(.$color == "red")))
R evaluates if color == "red"
, and treats TRUE
values like the number 1
and FALSE
values like the number 0
. So summing the number of TRUE
s and FALSE
s is equivalent to summing 1
’s and 0
’s. In the end, this operation counts the number of beads where color
equals “red”
.
Use mutate()
one last time to create the variable prop_red
. Set prop_red
to numb_red
divided by the sample size (in this exercise we are using a set sample size of 50).
... |> mutate(prop_red = ... / ...)
tibble(trial_ID = 1:1000) |> mutate(shovel = map(trial_ID, ~ slice_sample(urn, n = 50))) |> mutate(numb_red = map_int(shovel, ~sum(.$color == "red"))) |> mutate(prop_red = numb_red / 50)
prop_red
estimates our proportion of red beads in the urn. We will have 1,000 proportions of red beads, which we refer to as a sampling distribution.
Behind the scenes, we have assigned the results of that pipe to a new object: virtual_samples
. Type virtual_samples
and hit "Run Code."
virtual_samples
virtual_samples
Recall the difference between taking 1,000 trials versus 10,000 trials, as the number of trials increases, the sample mean will tend to get closer to the population mean, but it also makes room for the occurrence of rare and extreme values.
Now start a pipe with virtual_samples
. Use ggplot()
to map prop_red
to the x-axis.
virtual_samples |> ggplot(aes(...))
virtual_samples |> ggplot(aes(x = prop_red))
Add the layer geom_histogram()
to create a histogram of our data.
... + geom_histogram()
virtual_samples |> ggplot(aes(x = prop_red)) + geom_histogram()
Within geom_histrogram()
set binwidth
to .02, boundary
to .4, and color
to "white".
... + geom_histrogram(binwidth = ..., boundary = ..., color = ...)
virtual_samples |> ggplot(aes(x = prop_red)) + geom_histogram(binwidth = 0.02, boundary = 0.4, color = "white")
Recall that p
is equal to the proportion of beads which are red in each sample.
To finish, use labs()
to give your graph the appropriate title and axis labels. See hint for guidance to create the symbol $\hat{p}$.
Reminder: This is what your plot should look like.
plot_vs
Within labs(), set x to expression(hat(p))
virtual_samples |> ggplot(aes(x = prop_red)) + geom_histogram(binwidth = 0.01, boundary = 0.4, color = "white") + labs(x = expression(hat(p)), y = "Count", title = "Distribution of 1,000 proportions red")
This visualization allows us to see how our results differed between our tactile and virtual urn results. As we can see, there is some variation between our results. This is not a cause for concern, as there is always expected sampling variation between results.
Standard errors (SE) quantify the effect of sampling variation on our estimates. In other words, they quantify how much we can expect the calculated proportions of a shovel’s beads that are red to vary from one sample to another sample to another sample, and so on. As a general rule, as sample size increases, the standard error decreases.
In this section, we will create the following plot that displays different standard deviations of red bead proportions for 100 different shovel sizes.
shovel_p <- shovels_100 |> ggplot(aes(x = shovel_size, y = st_dev_p_hat)) + geom_point() + labs(title = "Sampling Variation", subtitle = "Larger samples have less variation", x = "Shovel size", y = "Standard deviation of the proportion red") shovel_p
In your own words, define standard error.
question_text(NULL, message = "The standard error is the standard deviation of a sample statistic (aka point estimate), such as the proportion.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The standard error equals the standard deviation of the data divided by the square root of the sample size. Accordingly, the standard error is inversely proportional to the square root of the sample size. The larger the sample size, the smaller the standard error.
Run expand_grid()
with two arguments. The first argument is trial_ID = c(1:100)
. The second argument is shovel_size = c(1:100)
. We need a hundred trials to draw reasonable conclusions about what happens when the size of the shovel varies between 1 and 100.
expand_grid(trial_ID = ..., shovel_size = ...)
expand_grid(trial_ID = c(1:100), shovel_size = c(1:100))
The resulting tibble has 10,000 rows because 100 trials times 100 shovel sizes equals 10,000. In other words, for each shovel size, we will perform 100 simulations.
Continue the pipe with mutate()
, creating a new column called shovel
. Set shovel
equal to a map()
function, passing in shovel_size
as the first argument, and the slice_sample()
function as the second argument. Within slice_sample()
, the first argument should be urn
(the data we want to sample from), and then set n
equal to .
(we want to pass in the shovel_size
using map()
). Don't forget the ~
.
... |> mutate(shovel = map(..., ~ ... ))
... |> mutate(shovel = map(... , ~ slice_sample(..., n = ...)))
expand_grid(trial_ID = c(1:100), shovel_size = c(1:100)) |> mutate(shovel = map(shovel_size, ~ slice_sample(urn, n = .)))
In the second line, we use shovel_size
rather than trial_ID
as the mapping variable since we can no longer hard code the shovel size into the call to slice_sample()
.
Continue your pipe with mutate()
to create the variable numb_red
, which will tells us the number of red beads present. Set numb_red
to the function map_int()
. The first argument to map_int()
should be shovel
. The second argument should take the sum()
of where the column color
of shovel
is equal to red.
... |> mutate(... = map_int(..., ~ ...))
... |> mutate(... = map_int(..., ~ sum(.$color == "red")))
expand_grid(trial_ID = c(1:100), shovel_size = c(1:100)) |> mutate(shovel = map(shovel_size, ~ slice_sample(urn, n = .))) |> mutate(numb_red = map_int(shovel, ~ sum(.$color == "red")))
The purpose of setting the shovel_size
from 1 to 100 is that we want to illustrate the relationship between the sample size and the sample variation. Specifically, as the sample size increases, the sample-to-sample variation decreases, and our guesses at the true proportion of the urn’s beads that are red get more precise.
Continue your pipe from above, using mutate()
to create one final column called prop_red
which represents the proportion of red beads in a sample. Set prop_red
to numb_red
divided by the shovel_size
column.
... |> mutate(prop_red = ... / ...)
expand_grid(trial_ID = c(1:100), shovel_size = c(1:100)) |> mutate(shovel = map(shovel_size, ~ slice_sample(urn, n = .))) |> mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) |> mutate(prop_red = numb_red / shovel_size)
Why does variation decrease as sample size increases? If we use a large sample size like 100 or 500, our sample is much more representative of the population. As a result, the proportion red in our sample ($\hat{p}$) will be closer to the true population proportion ($p$).
Continue the pipe with summarize()
to create a new column named st_dev_p_hat
which is equal to the standard deviation of prop_red
. (sd()
calculates standard deviation). The second argument is .by
equal shovel_size
to group the st_dev_p_hat
based on the shovel size.
... |> summarize(st_dev_p_hat = sd(...))
expand_grid(trial_ID = c(1:100), shovel_size = c(1:100)) |> mutate(shovel = map(shovel_size, ~ slice_sample(urn, n = .))) |> mutate(numb_red = map_int(shovel, ~ sum(.$color == "red"))) |> mutate(prop_red = numb_red / shovel_size) |> summarize(st_dev_p_hat = sd(prop_red), .by = shovel_size)
This will return a tibble with 2 columns, shovel_size
and st_dev_p_hat
, with 100 rows associated to 100 shovel sizes. By looking at the output we can tell that as the shovel size increases, the standard deviation decreases.
Behind the scenes, we have assigned the result of the pipe to shovels_100
. Type shovels_100
and hit "Run Code."
shovels_100
shovels_100
The Central Limit Theorem states, more or less, that when sample means are based on larger and larger sample sizes, the sampling distribution of these sample means becomes both narrower and more bell-shaped. In other words, the sampling distribution increasingly follows a normal distribution and the variation of this sampling distribution gets smaller, meaning smaller standard errors.
Start a new pipe from shovels_100
. Use ggplot()
to map shovel_size
to the x-axis and st_dev_p_hat
to the y axis. Also, add the layer geom_point()
to create a scatterplot.
shovels_100 |> ggplot(aes(x = ..., y = ...)) + geom_point()
shovels_100 |> ggplot(aes(x = shovel_size, y = st_dev_p_hat)) + geom_point()
Each point in the graph is the shovel size and its associated standard deviation of the proportion red. Note that when the shovel size is 1, the standard deviation is highest, and the standard deviation decreases as the shovel size increases, as we saw in the output above.
Now use labs()
to label the x-axis "Shovel size" and the y-axis
"Standard deviation of the proportion red". You should also provide a title and subtitle.
Reminder: This is what your plot should look like.
shovel_p
... + labs(title = "Sampling Variation", subtitle = "Larger samples have less variation", x = "Shovel size", y = "Standard deviation of the proportion red")
shovels_100 |> ggplot(aes(x = shovel_size, y = st_dev_p_hat)) + geom_point() + labs(title = "Sampling Variation", subtitle = "Larger samples have less variation", x = "Shovel size", y = "Standard deviation of the proportion red")
This is the power of running many analyses at once using map functions and list columns: before, we could tell that the standard deviation was decreasing as the shovel size increased, but when only looking at shovel sizes of 25, 50, and 100, it wasn’t clear how quickly it was decreasing.
In one sentence, describe the relationship between standard error and the sample size.
question_text(NULL, message = "The standard error of an estimate decreases at the rate of the square root of the sample size.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
All we can know is that we know nothing. And that’s the height of human wisdom. - Leo Tolstoy
We have an urn with 1,000 identically sized red and white beads. We do not know the breakdown number of beads between red and white.
We are interested in the number of white and red beads which will come out of the urn in the future.
We decide to make a guess by drawing a sample of 50 beads from the urn. It turns out that 17 of them were red.
What proportion of beads in the urn are red?
In your own words, describe the key components of Wisdom when working on a data science problem.
question_text(NULL, message = "Wisdom requires the creation of a Preceptor Table, an examination of our data, and a determination, using the concept of validity, as to whether or not we can (reasonably!) assume that the two come from the same population.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
There are many things about the population that are unknown, which we refer to as population parameters. Since we were interested in the proportion of the urn's beads that were red, the population parameter is the population proportion, denoted as $p$.
Define a Preceptor Table.
question_text(NULL, message = "A Preceptor Table is the smallest possible table of data with rows and columns such that, if there is no missing data, we can easily calculate the quantities of interest.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
To answer the question we have, we need a table that contains the color of every bead in the urn now. With that information, we can easily compute the total number of red beads, divided by the total beads, to get the proportion.
Describe the key components of Preceptor Tables in general, without worrying about this specific problem. Use words like "units," "outcomes," and "covariates."
question_text(NULL, message = "The rows of the Preceptor Table are the units. The outcome is at least one of the columns. If the problem is causal, there will be at least two (potential) outcome columns. The other columns are covariates. If the problem is causal, at least one of the covariates will be a treatment.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Although we have no reason to believe that this urn is anything other than what it appears to be, we should always worry about what might be happening, about how we might be fooled.
What are the units for this problem?
question_text(NULL, message = "The units are the individual beads in the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Do the units differ other than in color? We don't know! Perhaps red beads are much heavier than white beads. Perhaps the surface texture of red beads is much rougher.
What is/are the outcome/outcomes for this problem?
question_text(NULL, message = "The outcome is the color for each individual bead in the urn. There are only 2 possible values for the outcome, which is red or white.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In a typical urn problem, we just assume, or are told, that the beads are identical. But, outside of a simple urn like this, units always differ in ways of which you are not aware.
What are some covariates which you think might be useful for this problem?
question_text(NULL, message = "Covariates is the general term for all the variables which are not the **outcome**. Some examples of covariates in this problem might be the size of the beads, or the weight of the beads.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If the beads in the urn are not equally sized, then the small beads that can fit in the shovel will be more likely to show up. In the case, the bead size is the covariate that might affect the outcome, which is the proportions of red beads in the urn.
What are the treatments, if any, for this problem?
question_text(NULL, message = "There is no treatment since the question we are interested in is predictive, not causal.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If there were a treatment, it would need to be something that is, or could be, changed about individual beads. That would only be relevant if the question concerned the effect of that treatment.
What moment in time does the Preceptor Table refer to?
question_text(NULL, message = "The moment we are considering is the time when the beads were drawn from the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table refers to "now," the moment in time to which the question (implicitly) refers. Although now is the most comment relevant time for a Preceptor Table, there are often occasions in which the question refers to a specific time in the future or, less commonly, to the past.
Describe in words the Preceptor Table for this problem.
question_text(NULL, message = "The Preceptor Table has one row for every bead in the urn, and one (outcome) column for its color.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The Preceptor Table for this problem looks something like this:
#| echo: false tibble(bead_ID = c("1", "2", "...", "200", "201", "...", "2078", "2079", "..."), color = c("white", "white", "...", "red", "white", "...", "red", "white", "...")) |> gt() |> tab_header(title = "Preceptor Table") |> cols_label(bead_ID = "ID", color = "Color") |> tab_style(cell_borders(sides = "right"), location = cells_body(columns = c(bead_ID))) |> cols_align(align = "center", columns = everything())
Write one sentence describing the data you have to answer your question.
question_text(NULL, message = "We have a sample of 50 red and white beads drawn from an urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Any description of the data should only refer to the actual data you have. Assumptions about the underlying population should be stated elsewhere. In this case, the data only includes red and white beads. Does the urn only have red and white beads? Maybe! We have been told that this is the case, but we don't know for sure.
Load the tidyverse package.
library(...)
library(tidyverse)
How much do we know about the shovel? Not much! We have been told that the shovel is not biased, and that every bead has an equal chance of being selected by the shovel, but without closer examination/testing, we can't be sure that this is true. Perhaps the shovel is coated with a material which adheres more tightly to red beads.
In your own words, define "validity" as we use the term.
question_text(NULL, message = "Validity is the consistency, or lack thereof, in the columns of the data set and the corresponding columns in the Preceptor Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We are worried that the column labeled color
in the Preceptor Table does not correspond to the column labeled color
in the data.
What can't we do if the assumption of validity is not true?
question_text(NULL, message = "We can't combine the Preceptor Table and the data in order to construct the Population Table.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Because this is just a toy example, this isn't much of a concern, but the longer the time period between sampling the data and the Preceptor Table, the more problems we might have.
Provide one reason why the assumption of validity might not hold for this problem.
question_text(NULL, message = 'Color change over time would violate the assumption of validity. Imagine that we took the sample several years ago and that the urn is stored in the sun. In that case, the red color of the beads might fade to white over time. In that case, "color" in the Preceptor Table would not be the same thing as "color" in the data.', answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
That scenario is fairly fanciful. The assumption of validity holds extremely well in toy examples involving urns.
Summarize the state of your work so far in one sentence. Make reference to the data you have and to the specific question you are trying to answer.
question_text(NULL, message = "Using a sample of 50 beads drawn from an urn in which all beads are either red or white, we seek to determine the proportion of beads which are red.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
It is in justice that the ordering of society is centered. - Aristotle
In your own words, name the four key components of Justice for working on a data science problem.
question_text(NULL, message = "Justice concerns four topics: the Population Table, stability, representativeness, and unconfoundedness.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We first ask whether the model is stable across different time periods; then we look into the rows associated with the data and separately, rows associated with the Preceptor Table to check whether they are representative of all units from the population. Lastly, and only for causal models, we consider unconfoundedness.
In your own words, define a Population Table.
question_text(NULL, message = "The Population Table includes a row for each unit/time combination in the underlying population from which both the Preceptor Table and the data are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
We must acknowledge that the sample from the urn could have been taken at any time, so the contents of the urn in the past (our data) could be different from the contents of the urn when we want to answer our question now (the Preceptor Table). As such, there is a wider population we could have collected our data from: any time before collecting the sample, or anytime after collecting it.
In your own words, define the assumption of "stability" when employed in the context of data science.
question_text(NULL, message = "Stability means that the relationship between the columns in the Population Table is the same for three categories of rows: the data, the Preceptor Table, and the larger population from which both are drawn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability is all about time. Is the relationship among the columns in the Population Table stable over time? In particular, is the relationship --- which is another way of saying "mathematical formula" --- at the time the data was gathered the same as the relationship at the (generally later) time references by the Preceptor Table?
Provide one reason why the assumption of stability might not be true in this case.
question_text(NULL, message = "Consider the previous example of turning the urn in the morning versus in the afternoon. The person turning the urn might feel tired after doing this task many times in the day and thus did not turn it properly, which resulted in the fact that red beads appeared more frequently. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
The difference can be even larger if the person turning the urn today and the person doing this task tomorrow is a different person. The longer the time period covered by the Preceptor Table (and the data), the more suspect the assumption of stability becomes.
In your own words, define the assumption of "representativeness" when employed in the context of data science.
question_text(NULL, message = "Representativeness, or the lack thereof, concerns two relationship, among the rows in the Population Table. The first is between the Preceptor Table and the other rows. The second is between our data and the other rows.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Stability looks across time periods. Reprentativeness looks within time periods.
Provide one reason why the assumption of representativeness might not be true in this case.
question_text(NULL, message = "The person turning the urn to collect the sample might favor red beads over white and try to take more red beads within each draw, and thus our sample is not representative of the urn.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
In your own words, define the assumption of "unconfoundedness" when employed in the context of data science.
question_text(NULL, message = "Unconfoundedness means that the treatment assignment is independent of the potential outcomes, when we condition on pre-treatment covariates.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Since this is a predictive question, we do not worry about unconfoundedness.
Summarize the state of your work so far in two or three sentences. Make reference to the data you have and to the question you are trying to answer. Feel free to copy from your answer at the end of the Wisdom Section. Mention one specific problem which casts doubt on your approach.
question_text(NULL, message = "Using a sample of 50 beads drawn from an urn in which all beads are either red or white, we seek to determine the proportion of beads which are red. Our estimate would be flawed if someone tried to collect red beads as much as possible when making the draw. ", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Courage is the commitment to begin without any guarantee of success. - Johann Wolfgang von Goethe
In your own words, describe the components of the virtue of Courage for analyzing data.
question_text(NULL, message = "Courage begins with the exploration and testing of different models. It concludes with the creation of a data generating mechanism.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
If we could only know two pieces of information from our data, what would they be? First, you need a measure of the center of the distribution (e.g. mean or median), which shows the center of our data points. Second, we need a measure of the variability of the distribution.
With 17 red beads within the sample of 50 beads, calculate the percentage of red beads in the sample. Type the formula in the chunk, pipe the result to an object called x_bar
, type x_bar
in a separate line and hit Run Code.
x_bar <- .../50 x_bar
x_bar <- 17/50 x_bar
Common sense suggests that this is a good estimate of the center of the posterior. To understand our center, we must understand how different (or how spread out) our data points are from one another. Thus, we need a measure like sd()
.
Type the command sd()
with the argument c(rep(0, 17), rep(1, 33))
. This will return the standard deviation of the sample size.
...(c(...(0, 17), ...(1, 33)))
sd(c(rep(0, 17), rep(1, 33)))
The mean or median is a good estimate for the center of the posterior and the standard error is a good estimate for the variability of the posterior, with +/- 2 standard errors covering 95% of the outcomes.
Continue with the code above, calculate the standard error. Recall that the difference between this two is that for a given sample size, the standard error equals the standard deviation divided by the square root of the sample size.
... / sqrt(...)
sd(c(rep(0, 17), rep(1, 33))) / sqrt(50)
The standard error measures the accuracy of a sample distribution as compared to the population by using the standard deviation. According to the formula, notice that larger sample sizes = lower standard errors = more accurate and representative guesses.
With the standard error, calculate the 95% confidence interval for the proportion of red beads in the urn. Recall the formula for confidence interval is:
$$ SE = \bar{x} \pm 2SE $$
question_text(NULL, message = "With 95% confidence, the proportion of red beads in the urn is between 21% and 47%.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Standard error is really just a measure for how uncertain we are about something we do not know, the thing we are estimating. When we recall SE, we should remember that, all in all, it’s a complicated concept that can be distilled into: the way old people talk about confidence intervals.
Temperance is a bridle of gold; he, who uses it rightly, is more like a god than a man. - Robert Burton
In your own words, describe the use of Temperance in finishing your data science project.
question_text(NULL, message = "Temperance uses the data generating mechanism to answer the questions with which we began. Humility reminds us that this answer is always a lie. We can also use the DGM to calculate many similar quantities of interest, displaying the results graphically.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Recall the question we are trying to answer:
Given an urn consists of 1,000 identically sized red and white beads in the urn, mixed well together. What proportion, $p$, of this urn's beads are red?
We will be creating the following graph to answer the question:
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count/sum(count)))) + labs(title = expression(paste("Posterior Distribution for ", rho)), subtitle = "There is a 95% chance for a value between 21% and 47%.", x = expression(paste("Proportion, ", rho, ", of Red Beads in Urn")), y = "Probability") + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format()) + theme_classic()
Create a tibble()
command with the argument obs
equal rnorm(1000000, mean = 0.34, sd = 0.067
)
tibble(... = rnorm(10000, ... = 0.34, sd = ...))
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067))
Your posterior for (almost) any population parameter is normally distributed with a mean equal to the sample mean and a standard deviation equal to the standard error.
Continue the pipe with ggplot()
, mapping x
to obs
in the aes()
argument.
... |> ggplot(...(x = ...))
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs))
The normal distribution has a bell-shaped curve, which is defined by two parameters: (1) the mean $\mu$ which locates the center of the distribution and (2) the standard deviation $\sigma$ which determines the variation of values around that center. These two variables are the most important parameters for the population with which we can construct any distribution.
Add geom_density()
, setting y
equal to after_stat(count/sum(count))
in the aes()
argument.
... |> ggplot(aes(x = ...)) + geom_density(...(y = after_stat(... / sum(count))))
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count / sum(count))))
Some common probability distributions do have names, depending on the output variable. Continuous means "Gaussian", 2 possible values means "Bernoulli", and 2+ possible values means categorical.
Add scale_x_continuous()
with the argument labels
equal scales::percent_format()
.
... + scale_x_continuous(... = scales::percent_format())
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count / sum(count)))) + scale_x_continuous(labels = scales::percent_format())
With the outcome we are interested in, the color of each individual bead, and the probability distribution of its value, we can write a complete sentence about our model. One example would be: We model the color of an individual bead as a draw from a Bernoulli distribution with unknown parameter $p$.
Add scale_y_continuous()
with the argument labels
equal scales::percent_format()
.
... + scale_y_continuous(... = scales::percent_format())
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count/sum(count)))) + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format())
The number 10,000 within the rnorm()
represents the total proportions of red beads we have. Recall that we have 100 trials and each trial has 100 shovels, resulting in 100,000 results.
Add theme_classic()
... + theme_classic()
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count/sum(count)))) + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format()) + theme_classic()
The percentage of red beads in the urn is about 34%, with a 95% credible interval from 21% to 47%.
Finally, add title
, subtitle
, labels for x
and y
axes.
Reminder: This is what your plot should look like.
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count/sum(count)))) + labs(title = expression(paste("Posterior Distribution for ", rho)), subtitle = "There is a 95% chance for a value between 21% and 47%.", x = expression(paste("Proportion, ", rho, ", of Red Beads in Urn")), y = "Probability") + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format()) + theme_classic()
... + labs(title = expression(paste("Posterior Distribution for ", rho)), subtitle = "...", x = expression(paste("Proportion, ", rho, ", of Red Beads in Urn")), y = "...")
tibble(obs = rnorm(10000, mean = 0.34, sd = 0.067)) |> ggplot(aes(x = obs)) + geom_density(aes(y = after_stat(count/sum(count)))) + labs(title = expression(paste("Posterior Distribution for ", rho)), subtitle = "There is a 95% chance for a value between 21% and 47%.", x = expression(paste("Proportion, ", rho, ", of Red Beads in Urn")), y = "Probability") + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format()) + theme_classic()
Note that this graph is not precisely accurate as we neither come to this end by calculating our posterior distribution for $p$ by hand, as in Chapter 2 nor use an R package, as in Chapter 4. However, the purpose of this chapter is to develop your intuition. Generating a 95% confidence interval using two times the standard error along with a normality assumption is good enough.
Write a few sentences summarize your work to answer the questions with which we began.
question_text(NULL, message = "Using a sample of 50 beads drawn from an urn in which all beads are either red or white, we seek to determine the proportion of beads which are red. Our estimate would be flawed if someone has changed the beads in the urn since we drew our sample. We model the color of an individual bead as a draw from a Bernoulli distribution with unknown parameter. The percentage of red beads in the urn is about 34%, with a 95% credible interval from 21% to 47%.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
This tutorial covered Chapter 3: Sampling of Preceptor’s Primer for Bayesian Data Science: Using the Cardinal Virtues for Inference by David Kane.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.