library(tidyverse)
library(PPBDS.data)
library(learnr)
library(shiny)
library(skimr)
library(infer)
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
options(tutorial.storage="local")  

# Key Data

set.seed(9)
urn <- tibble(color = c(rep("red", 925), rep("white", 575))) %>%
  slice_sample(prop = 1)

use_shovel <- function(x, size, reps){
  x %>% rep_sample_n(size = size, reps = reps)
}

prop_red <- function(x, size, reps){
  use_shovel(x = x, size = size, reps = reps) %>% 
  group_by(replicate) %>% 
  summarize(red = sum(color == "red"), .groups = "drop_last") %>% 
  mutate(prop_red = red/size)
}

Confirm Correct Package Version

Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."

packageVersion('PPBDS.data')

The answer should be ‘0.3.2.9008’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:

remove.packages('PPBDS.data')  
library(remotes)  
remotes::install_github('davidkane9/PPBDS.data')  

Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.

Name

question_text(
  "Student Name:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Email

``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )

## The "virtual shovel"
###

We have created a virtual urn with some unknown amount of red beads and white beads. Our goal will eventually be to estimate the total number of red beads in the urn. 

For this portion of the tutorial, we will be replicating the graph below. The goal is to build this graph step-by-step, using the knowledge from The Primer's Chapter 6. 

```r
urn %>%
  rep_sample_n(size = 20, reps = 50, replace = TRUE) %>%
  mutate(is_red = (color == "red"), 
         is_white = (color == "white"), 
         .groups = "drop") %>%
  summarize(num_red = sum(is_red), 
            num_white = sum(is_white),
            .groups = "drop") %>%
  mutate(prop_red = num_red / (num_red + num_white)) %>%
  mutate(estimated_red = prop_red * 1500) %>%
  ggplot(mapping = aes(x = estimated_red)) +
    geom_histogram(binwidth = 75, color = "white") +
    scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2)) +
    scale_x_continuous(breaks = seq(from = 0, to = 1500, by = 100)) +
    labs(title = "Range of Estimates Using Shovel of Size 20",
         x = "Estimated number of red beads")

Exercise 1

To view the urn, simply type urn in window below and hit Run Code.


Exercise 2

It appears that there are 1,500 beads in the urn (i.e. 1,500 rows in our tibble). Run nrow() on urn to confirm the number of beads in the urn.


Exercise 3

To simulate the process of real-world sampling, where we generally won't know exactly how many items are in our population, pipe urn into rep_sample_n() from the infer package, with a shovel size of 20 (i.e. the sample should have 20 observations).


urn %>%
  rep_sample_n(size =...)

Exercise 4

Copy and paste your code from the previous exercise. Continuing your pipe, use mutate() to create two new variables: is_red should be a logical equal to the result of asking if color is "red," and is_white should be identical but replaced with the color "white."


is_red should be equal to (color == "red").

Exercise 5

Next, use summarize() to create two more variables. num_red should be equal to the sum() of the is_red column, and num_white should be equal to the sum() of the is_white column.


summarize(num_red = ..., num_white = ...)

Exercise 6

Call mutate() to create the variable prop_red, the proportion of the sampled beads that are red. Set prop_red equal to the number of red beads divided by the sum of the number of red beads and white beads.


The proportion of red beads is num_red / (num_red + num_white).

Exercise 7

Finally, use mutate() to create the estimated_red variable, our estimate of the number of red beads in the urn. estimated_red should be equal to prop_red multiplied by 1,500, the total number of beads in the urn.


The estimated_red variable should be equal to (prop_red * 1500).

Exercise 8

Nice work! Running the code below gives us a way to virtually sample our urn in 20-bead increments. Notice that running the code multiple times gives us different estimates for the number of red beads in the urn. Each time we re-run the code, rep_sample_n() grabs a different sample from the urn.

Our next step will be to use our "virtual shovel" 50 times and create a distribution of our results. Start by setting the reps argument of rep_sample_n() to 50. Up until now, the default value of reps has been 1. By setting reps to 50, we are telling our machine to sample 20 beads from the urn fifty times. Set the replace argument to TRUE. This is equivalent to sampling with replacement, which is what we want. You will also need to set the .groups argument in summarize to something like "drop" to get rid of the error message.

urn %>%
  rep_sample_n(size = 20) %>%
  mutate(is_red = (color == "red"), is_white = (color == "white")) %>%
  summarize(num_red = sum(is_red), 
            num_white = sum(is_white)) %>%
  mutate(prop_red = num_red / (num_red + num_white)) %>%
  mutate(estimated_red = prop_red * 1500)

Exercise 9

Use ggplot() and geom_histogram() to create a histogram that maps our 50 values of estimated_red to the x-axis. Set binwidth to 75 and color to "white".


First copy and paste your previous our code from above
ggplot(mapping = aes(...) + geom_histogram(binwidth= 75, color= "white")

Exercise 10

We can use the breaks argument of a scale_ function, alongside the seq() function, to reset the location of the tick marks on either axis. Run the code below to see how we can reset the tick marks on the y-axis to the even numbers from 0 to 20.

urn %>%
  rep_sample_n(size = 20, reps = 50, replace = TRUE) %>%
  mutate(is_red = (color == "red"), 
         is_white = (color == "white"), 
         .groups = "drop") %>%
  summarize(num_red = sum(is_red), 
            num_white = sum(is_white)) %>%
  mutate(prop_red = num_red / (num_red + num_white)) %>%
  mutate(estimated_red = prop_red * 1500) %>%
  ggplot(mapping = aes(x = estimated_red)) +
    geom_histogram(binwidth = 75, color = "white") +
    scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2))

Exercise 11

Use scale_x_continuous() to reset the tick marks on the x-axis to the multiples of 100 between 0 and 1,500. Copy and paste the code from the previous exercise to visualize the adjustment that we made to the y-axis and apply it to this prompt.


The "by" argument of your seq() function should be 100.

Exercise 12

Great job! By running the code below a few times, we see that the number of red beads in the urn is probably somewhere between 700 and 1,100.

To finish this graph, use labs() to rename the x-axis label to "Estimated number of red beads". Set reps equal to 50.


Different Sample Sizes

Imagine we have three choices of shovels to extract a sample of beads with: shovels of size 30, 60, and 120.

Our eventual goal will be to create this graph, which shows the standard deviation in results for our three different shovel sizes. Remember: the lower the standard deviation, the more precise our response.

shovels <- tibble(shovel_size = c(30, 60, 120))

shovels <- shovels %>% 
  mutate(use_shovel_results = map(shovel_size,
                                  ~ use_shovel(x = urn,
                                               size = .x,
                                               reps = 1000))) %>% 
  mutate(prop_red_results = map(shovel_size,
                                ~ prop_red(x = urn,
                                            size = .x,
                                            reps = 1000)))



shovels_plot <- shovels %>%
 unnest(prop_red_results) %>%   
 ggplot(aes(x = shovel_size, y = prop_red)) +
  geom_point() +
   labs(x = "Shovel size",
        y = "Proportion of red beads")

shovels_plot

Exercise 1

Create the object virtual_samples_30 by performing rep_sample_n on our object urn with a size of 30 and reps of 1000.


virtual_samples_30 <- urn %>%
  rep_sample_n(size = ..., reps = ...)

Exercise 2

To perform the above code three more times, let's create a function called use_shovel. This function draws a specified sample size with a specified number of reps. Remember that we want for size and reps to not be explicitly declared, so that we may use the function for any shovel size and any number of reps.

virtual_samples_30 <- urn %>%
  rep_sample_n(size = 30, reps = 1000)

use_shovel <- function(x, size, reps){
  x %>% ...
}
use_shovel <- function(x, size, reps){
  x %>% 
    rep_sample_n(size = size, ...)
}

Exercise 3

To test our function, call use_shovel with (x = urn, size = 30, and reps = 2)


use_shovel(x = urn, ..., ...)

Exercise 4

Now we need a function that will calculate the proportion of beads that are red. To accomplish this, we will need to use the function use_shovel within another function, which we will call prop_red. The first line of the function prop_red should be use_shovel as we have created prior. The next parts should group results by replicate, summarize red for where the color == "red", and use mutate to create prop_red for the number of red beads divided by size.


prop_red <- function(x, size, reps) {
  use_shovel(x = x, ..., ...) %>% 
  group_by(...) %>% 
  summarize(red = sum(...), .groups = ...) %>% 
  mutate(prop_red = .../size)
}
Read "Functions are your friend!". The answer can be found there. 

Exercise 5

Now, let's test our new function prop_red by drawing from our urn with a size of 30 and 2 repetitions.


prop_red(x = urn, size = ..., reps = ...)

Exercise 6

Now that we have confirmed that both the prop_red and use_shovel functions are working properly, let's begin making our graph.

To construct this graph, we will be using some mapping. This saves us the time of making three different objects (for shovels of 30, 60, and 120). Start by creating the object shovels which should construct a shovel_size column of 30, 60, and 120.


shovels <- tibble(shovel_size = c(...))

Exercise 7

We will now modify the object shovels by using a mutate to create the column use_shovel_results. use_shovel_results should use map shovel_size by using our function use_shovel with x = urn, size = ., and reps = 1000. If this sounds confusing, make sure you've looked at "Functions are your friends" carefully.


shovels <- shovels %>% 
  mutate(use_shovel_results = map(shovel_size,
                                  ~ use_shovel(x = ...,
                                               size = ...,
                                               reps = ...)

Exercise 8

To add another column, copy the above code and call another mutate to create the column prop_red_results. Like our last exercise, we want to use map using shovel size, but (this time) with our function prop_red to get the proportion of red beads. We will retain the same characteristics for prop_red as we did for use_shovel: x = urn, size = .x, and reps = 1000.


shovels <- shovels %>% 
  ... %>% 
  mutate(prop_red_results = map(shovel_size,
                                ~ prop_red (...))

Exercise 9

Now that we have finished modifying our object shovels, let's construct our graph using ggplot! Recall that shovels, like any tibble with list columns, may require some manipulation before you can plot it. The key command is often unnest(), which splits up the specified column so that each element gets its own row.

Start by calling shovels and then calling unnest to split up the observations. Look at the resulting tibble so that you understand it. (Hint: There are some new columns.) Then, use ggplot with the shovel_size as the x-axis and prop_red as the y-axis. Add geom_point.


shovels %>% 
  unnest(...) %>% 
  ggplot(aes(x = ..., y = ...)) +
  geom_point()

Exercise 10

Fantastic work! To finish this graph, use labs to create a label for the x and y axis. Here is our graph for reference:

shovels_plot

Standard deviation

In the last section, we figured out that an increase in sample size leads to greater precision in the sampling distribution. But how can we mathematically measure this increase in precision? The standard deviation of a set of values gives us a mathematical framework for measuring precision---reread chapter 2 in the textbook if you need clarification.

Exercise 1

In the code chunk below, run sd() (the function that returns standard deviation) on the vector c(1, 2, 3, 4, 5).

sd (c(1, 2, 3, 4, 5))
# Remember that the vector above is equal to 1:5.

Exercise 2

Now calculate the standard deviation of the vector c(143, 144, 145, 146, 147). See if you can predict what the result will be before you run the code.

sd(c(143, 144, 145, 146, 147))
sd(c(...))

Exercise 3

Calculate the standard deviation of the vector c(2, 3, 4) and compare the result to the results from the previous two exercises.

sd(c(2, 3, 4))

Exercise 4

Finally, calculate the standard deviation of c(1, 1, 1, 5, 5) and compare the result to the result of the first exercise.

sd(c(1, 1, 1, 5, 5))
sd(c(...))

Exercise 5

Let's recap the seq() function. Run the code below to see how we can return all multiples of 3 between 21 and 48.

seq(from = 21, to = 48, by = 3)

Exercise 6

The code below contains our final work from the previous section, excluding our ggplot().

Because we want to be more precise than the previous exercise, let's see if we can look at shovel sizes from 1 to 100, instead of 30, 60, and 120. Instead of our first line being shovels lets start by creating a tibble to redefine shovel_size for all values between 1 and 100.

shovels <- shovels %>% 
  mutate(use_shovel_results = map(shovel_size,
                                  ~ use_shovel(x = urn,
                                               size = .x,
                                               reps = 1000))) %>%
  mutate(prop_red_results = map(shovel_size,
                                ~ prop_red(x = urn, 
                                           size = .x, 
                                           reps = 1000)))
Replace "shovels" with tibble(shovel_size = 1:100)

Exercise 7

Now we will need to use mutate() to create a column that takes the standard deviation of the proportion red found from prop_red_results. After copying the above work, add the column prop_red_sd using map_dbl with prop_red_results. After the ~ sign, you will want to pull using (., prop_red) to isolate the proportion red. Finally, you will take the sd() of that pulled value.

Save your new work as shovels_100 instead of the previous shovels.


shovels_100 <- tibble(shovel_size = 1:100) %>% 
  mutate(prop_red_sd = map_dbl(prop_red_results,
                               ~ pull(...) %>% ...))

Exercise 8

To visualize what we have created, perform glimpse() on shovels_100.


Exercise 9

We now have the standard deviation of proportions red for all shovel sizes between 1 and 100. We now want to use ggplot to visualize this difference in a geom_point. Assign shovel_size to the x-axis and prop_red_sd to the y-axis.


shovels_100 <- %>% 
  ggplot(aes(...)) +
  geom_point()

Posterior distribution from sample

Finally, let's return to our question from the original section: how can we predict the number of beads in the urn based on the results of a sample? In the first section, we discovered that the number of red beads in the urn was probably somewhere between 700 and 1,100, but we could not come up with a better estimate. In this section, we will both improve our estimate and better simulate real-world sampling by estimating the number of beads in the urn from the result of a single sample.

Suppose that in a single 45-bead sample from our urn, 28 of the 45 beads are red. Let's use this information to predict the number of beads in the urn.

Exercise 1

We're going to start by analyzing a single model: 800 of the 1,500 beads being red. Using tibble(), create a tibble with a single column, color. color should be equal to the combination (c()) of "red" 800 times and "white" 700 times.


tibble(color = c(rep(...), rep(...)))

Exercise 2

Let's now take some samples from our table using rep_sample_n(). The first parameter of your rep_sample_n() statement should be equal to the tibble that you made in the previous exercise (remember to copy and paste!). Also set the size to 45, the reps to 1,000 (any reasonably large number works here), and replace to "TRUE."


tibble(...) %>% 
  rep_sample_n(size = ..., reps = ..., replace = ...)

Exercise 3

Pipe what you have into group_by() to group the tibble by the replicate variable.


Exercise 4

Use summarize()to create the variable red_in_sample, which should be equal to the sum() of the number of rows in each group with color equal to "red."


sample_given_urn <- function(...) {
  tibble(color = c(rep("red", ...), rep("white", ...))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == ...), .groups = ...)
}

Exercise 5

We now have code that makes 1,000 random samples from our 800-red-bead urn, but we want to generalize it to urns with other amounts of red beads. To do so, wrap the code above in a function named sample_given_urn. that has one parameter, red_in_urn. Replace the 800 in the first rep() statement with red_in_urn and the 700 in the second rep() statement with red_in_urn subtracted from 1,500.


sample_given_urn <- function(...) {
  tibble(color = c(rep("red", ...), rep("white", ...))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == "red"), .groups = "drop")
}

Exercise 6

To get a sense of how our function works, call sample_given_urn() below the function definition to calculate 1,000 random samples from an urn with 650 red beads.

sample_given_urn <- function(red_in_urn){
  tibble(color = c(rep("red", red_in_urn), 
                   rep("white", 1500 - red_in_urn))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == "red"), .groups = "drop")
}

Exercise 7

Now, below the function definition, call tibble() to create a new tibble with one column, nums, equal to the vector containing every multiple of 25 from 650 to 1,150.

sample_given_urn <- function(red_in_urn){
  tibble(color = c(rep("red", red_in_urn), 
                   rep("white", 1500 - red_in_urn))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == "red"), .groups = "drop")
}
tibble(nums = seq(...))

Exercise 8

Pipe your tibble into a mutate() statement to create the new variable sample_results. sample_results should be equal to the result of a map() statement whose first parameter is nums and whose second parameter is sample_given_urn(nums). Don't worry about the error for the time being.


Exercise 9

To fix the error from our map() statement, remember the two rules of using map(): replace nums with .x in sample_given_urn(), and precede sample_given_urn() with the ~ symbol.


Exercise 10

Nice job! Unfortunately, we are now left with a list column. Just as we did in the previous sections, pipe the tibble into unnest() to convert the list column into a standard tibble. Set the cols argument of unnest() equal to sample_results.


Exercise 11

Continuing your pipe, group the tibble by nums. Then use summarize() to create the variable num_samples, which should be equal to the sum() of the number of rows in each group with red_in_sample equal to 28.

sample_given_urn <- function(red_in_urn){
  tibble(color = c(rep("red", red_in_urn), 
                   rep("white", 1500 - red_in_urn))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == "red"), .groups = "drop")
}

tibble(nums = seq(from = 650, to = 1150, by = 25)) %>%
  mutate(sample_results = map(nums, ~ sample_given_urn(.x))) %>%
  unnest(cols = sample_results)

Exercise 12

Pipe the tibble into ggplot() to create a bar graph that maps nums to the x-axis and num_samples to the y-axis.


# Remember to use geom_col().

Exercise 13

Normalize the distribution by changing the y-axis mapping from num_samples to num_samples divided by the sum() of all num_samples. In addition, because the probabilities are so small, multiply num_samples by 100 before you divide by the sum() so we get percents instead of decimal probabilities.


ggplot(mapping = aes(x = nums, y = ... * ... / sum(...)))

Exercise 14

Fantastic job! From this graph, we can see that the most likely outcome for the number of red beads in the urn is between 900 and 975. Notably, though, there is a ton of variance in the outcome: the probability that the number of red beads is between 900 and 950 is only about 30%. The major lesson here is that a single small sample is not very predictive---you need either many samples or a very large sample size to be confident in your estimate.

To finish off this tutorial, you probably want to know how many red beads are actually in the urn. Below the ggplot(), pipe urn into a count() statement that takes color as its argument.

sample_given_urn <- function(red_in_urn){
  tibble(color = c(rep("red", red_in_urn), 
                   rep("white", 1500 - red_in_urn))) %>% 
    rep_sample_n(size = 45, reps = 1000, replace = TRUE) %>%
    group_by(replicate) %>%
    summarize(red_in_sample = sum(color == "red"), .groups = "drop")
}

tibble(nums = seq(from = 650, to = 1150, by = 25)) %>%
  mutate(sample_results = map(nums, ~ sample_given_urn(.x))) %>%
  unnest(cols = sample_results) %>%
  group_by(nums) %>%
  summarize(num_samples = sum(red_in_sample == 28), .groups = "drop") %>%
  ggplot(mapping = aes(x = nums, y = 100 * num_samples / sum(num_samples))) +
    geom_col() +
    scale_x_continuous(breaks = seq(from = 650, to = 1150, by = 50)) +
    labs(title = "Posterior Probability Distribution of Red Beads in Urn",
         subtitle = "Based on a 45-bead sample with 28 red beads",
         x = "Number of red beads",
         y = "Probability (percent)")
urn %>%
  count(...)

Submit

submission_ui
submission_server()


davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.