library(tidyverse)
library(PPBDS.data)
library(learnr)
library(shiny)
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
options(tutorial.storage="local")  

Confirm Correct Package Version

Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."

packageVersion('PPBDS.data')

The answer should be ‘0.3.2.9007’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:

remove.packages('PPBDS.data')  
library(remotes)  
remotes::install_github('davidkane9/PPBDS.data')  

Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.

Name

question_text(
  "Student Name:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Email

question_text(
  "Email:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Flipping a coin

Exercise 1

Let's make sure we understand the important vocabulary.

question_text("In a sentence, explain the difference between an empirical distrubtion and a posterior distribution.",
              answer(NULL, correct = TRUE), 
              incorrect = "Submitted.",
              message = "A posterior distribution is based on what you believe the distribution of some data will be, whereas an empiral distribution is the distribution of some actual data.")

Exercise 2

question_text("How many times must you conduct the experiment to create a mathemetical distribution?",
              answer(NULL, correct = TRUE), 
              incorrect = "Submitted.",
              message = "A mathematical distribution requires no experimental data because it relies solely on mathematical formulas. Thus, you don't need to conduct the experiment at all before creating a mathematical distribution.")

Exercise 3

Let's start by examining the standard, unweighted coin: i.e. $p_h = p_t = 0.5$.

The function rbinom() is used to calculate a binomial distribution (when there are only two possible outcomes) Look at the following code to see how we can simulate a single coin flip. Try running the code multiple times---see if you can estimate approximately how many times you get each outcome.

rbinom(n = 1, size = 1, prob = 0.5)

Exercise 4

In the code chunk below, change our default rbinom() statement such that the number of observations (the n parameter) is equal to 100.

rbinom(n = 1, size = 1, prob = 0.5)

Exercise 5

This time, change the number of trials (the size parameter) to 100. See if you can figure out the difference between observations and trials in the rbinom() function.

rbinom(n = 1, size = 1, prob = 0.5)

Exercise 6

Use tibble() to create a new tibble. The tibble should have one column, head, equal to a binomial distribution with 100 observations of a fair coin (prob = 0.5) and 1 trial (i.e. 1 flip) per observation.


tibble(head = rbinom(...))

Exercise 7

Pipe your tibble into ggplot() to make a histogram using geom_histrogram(). Map head to the x-axis and set binwidth to .5.


tibble(head = rbinom(n = 100, size = 1, prob = 0.5)) %>%
  ggplot(mapping = aes(x = head)) +
    geom_histogram(binwidth = ...) +

Exercise 8

We can use a scale_x_continuous function, along with the breaks and labels parameters, to reset the tick marks on the x-axis. Run the following code to see how to reset our graph's tick mark labels at x = 0 and x = 0.5.

tibble(head = rbinom(n = 100, size = 1, prob = 0.5)) %>%
  ggplot(mapping = aes(x = head)) +
    geom_histogram(binwidth = 0.5) +
    scale_x_continuous(breaks = c(0, 0.5), labels = c("Heads", "Random Label With No Bar"))

Exercise 9

Use scale_x_continuous() like we did in the previous code chunk to reset the tick mark labels at x = 0 and x = 1 to "Heads" and "Tails."

tibble(head = rbinom(n = 100, size = 1, prob = 0.5)) %>%
  ggplot(mapping = aes(x = head)) +
    geom_histogram(binwidth = 0.5)
tibble(head = rbinom(n = 100, size = 1, prob = 0.5)) %>%
  ggplot(mapping = aes(x = head)) +
    geom_histogram(binwidth = 0.5) +
    scale_x_continuous(breaks = ..., labels = ...)

If we ran the code above 1,000 times, we'd probably get a result fairly close to 50 heads and 50 tails. However, sometimes (just from random chance), there would be significantly more heads than tails, or vice versa. But exactly how often would each unique distribution occur?

Exercise 10

We're now going to make a different kind of empirical distribution. Start by using tibble() again to make a tibble. The sole column of the tibble should be heads, equal to the result of an rbinom() statement with 1,000 observations of a fair coin, 100 trials (flips) per observation, and 0.5 probability.


tibble(heads = ...)

Exercise 11

Pipe your tibble into ggplot() to make a histogram with heads mapped to the x-axis. Set the binwidth` of the histogram to 1.


Great work! This empirical distribution shows that while there is a significant amount of variance in the number of heads in 100 coin flips, nearly all of the observations have greater than 40 heads but less than 60 heads.

Rolling two dice

Let's now look at the probability distributions of one or more dice rolls. We no longer can use rbinom() to make a binomial distribution because there are six outcomes, not two.

Exercise 1

Recall from chapter 2 that we use the sample() function to generate random values. Run the code chunk below to see how we can generate ten random coin tosses using sample(). Remember to set replace equal to TRUE when you want a certain outcome (i.e. "heads" or "tails" in this case) to be achievable more than once.

sample(x = c("heads", "tails"), size = 10, replace = TRUE)

Exercise 2

Using sample(), simulate 10 dice rolls by setting the x argument to the vector of integers from 1 through 6. Note that you don't actually need to use the c() function in this case.


sample(x = c(..., ...), size = 10, replace = TRUE)
# 1:6 is equivalent to c(1, 2, 3, 4, 5, 6).

Exercise 3

Next, to make 10 simulations of the sum of two dice rolls, simply add together two of the sample() statements that you made in the last exercise.


# Just add the sample statement from the previous exercise to itself.

Exercise 4

Use tibble() to create a new tibble with one variable, sum, equal to the sum of the two sample() statements that you made in the last exercise. Remember to copy and paste your code from the previous exercise!


tibble(sum = ...)

Exercise 5

Add a ggplot() call, using a pipe, to create a histogram with sum mapped to the x-axis. Change geom_histogram()'s binwidth argument to 1 and its color argument to "white."


Normalized distributions

The y-axes of the graphs in this tutorial are slightly different than the y-axes of the graphs in the textbook. This is because the graphs in the textbook are normalized.

question_text("In a sentence, explain the difference between normalized and unnormalized distributions.",
              answer(NULL, correct = TRUE), 
              incorrect = "Submitted.",
              message = "A normalized distribution will display the probability of each outcome on the y-axis, whereas an unnormalized distribution will display the count of the number of times that outcome occurs.")

To make a normalized distribution, we'll need to manipulate the count variable of our histograms. Unfortunately, the count variable is created by ggplot2...not in our data set!

Exercise 1

To see why this is a problem, try running the following code, which attempts to make a histogram of dice rolls and then see how then see how the count for each roll differs from the average count. Then, continue onto the next exercise to see how we can fix the error.

tibble(roll = sample(x = 1:6, size = 20, replace = TRUE)) %>%
  ggplot(mapping = aes(x = roll, y = count / mean(count))) +
    geom_histogram(binwidth = 1, color = "white")

Exercise 2

To tell R that count is not actually in our data set but is created by geom_histogram(), we need to surround the variable by two dots on each side (..count..). Use this technique to adjust the code below so that there is no error.

tibble(roll = sample(x = 1:6, size = 20, replace = TRUE)) %>%
  ggplot(mapping = aes(x = roll, y = count / mean(count))) +
    geom_histogram(binwidth = 1, color = "white")

Exercise 3

Finally, copy and paste your code from the previous exercise. This time...divide each count by the sum() of the counts instead of the mean() of the counts.


dice_roll <- tibble(roll = sample(x = 1:6, size = 20, replace = TRUE)) %>%
  ggplot(mapping = aes(x = roll, y = ...)) +
    geom_histogram(binwidth = 1, color = "white")

Exercise 4

The code below makes the graph of the sum from rolling two dice 100 times. Notice how the graph currently shows counts, not probabilities, on the y-axis.

Use the technique from the previous exercise to normalize the distribution.

tibble(sum = sample(x = 1:6, size = 100, replace = TRUE) +
                           sample(x = 1:6, size = 100, replace = TRUE)) %>%
  ggplot(mapping = aes(x = sum)) +
    geom_histogram(binwidth = 1, color = "white") +
    scale_x_continuous(breaks = 2:12, labels = 2:12) +
    theme_classic() +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from Rolling Two Dice One Hundred Times",
         x = "Sum of Two Dice")
tibble(sum = sample(x = 1:6, size = 100, replace = TRUE) +
                           sample(x = 1:6, size = 100, replace = TRUE)) %>%
  ggplot(mapping = aes(x = sum, y = ...)) +
    geom_histogram(binwidth = 1, color = "white") +
    scale_x_continuous(breaks = 2:12, labels = 2:12) +
    theme_classic() +
    labs(title = "Empirical Probability Distribution",
         subtitle = "Sum from Rolling Two Dice One Hundred Times",
         x = "Sum of Two Dice")

Two models

Exercise 1

The classic example of two models is for disease testing. Suppose that 1% of a city with population 10,000 has a deadly disease. If you have the disease, the test will correctly diagnose you 99% of the time. If you don't have the disease, the test will correctly diagnose you 95% of the time. The question we want to answer is: What is the probability that you have the disease, given that you test positive?

Call an rbinom() statement with 10,000 observations, 1 trial per observation, and prob equal to 0.01. This will make a vector of length 10,000 such that every person infected by the disease (probability 0.01) is represented by the number 1, and every person not infected is represented by the number 0.


# Remember that n is the number of observations and size is the number of trials per observation.
rbinom(n = ..., size = ..., prob = ...)

Exercise 2

Using tibble(), create a tibble with one column, has_disease. has_disease should be equal to the result of the rbinom() statement from the previous exercise.


tibble(...)

Exercise 3

Copy and paste your answer from above. Start a pipe and mutate() a new variable, test_positive. test_positive should be equal the result of an if_else() statement with the condition being that has_disease is equal to 1. For now, if the condition is true, set test_positive to 1, but if the condition is false, set test_positive to 0.


tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(...,
                                 true = 1,
                                 false = 0))

Exercise 4

According to our current model, every person who has the disease automatically tests positive, and vice versa. To add the possibility of a false positive or false negative, we need to add two more rbinom() statements inside our if_else() statement.

Use the model I gave you in the true case to add an rbinom() statement to the false case. The rbinom() statement should have 10,000 observations, 1 trial per observation, and a probability of testing positive equal to 0.05 (since the test correctly diagnoses a negative result for an uninfected person 95% of the time).

tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = 0L))
tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = rbinom(...)))

Exercise 5

Awesome work! We now have a tibble that uses randomness to create a model of our disease.

Use the assignment operator (<-) to save the tibble from the previous exercise to disease_testing.


Exercise 6

Add a ggplot() call to make a jittered scatterplot that takes in disease_testing as its data, maps test_positive to the x-axis, and maps has_disease to the y-axis.

disease_testing <- tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = rbinom(n = 10000, size = 1, prob = 0.05)))
disease_testing <- tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = rbinom(n = 10000, size = 1, prob = 0.05)))

ggplot(...) +
  ...

Exercise 7

This empirical distribution graph makes it pretty clear that you are more likely not to have the disease even if you test positive. However, it's hard to tell exactly how much likelier. For that, we're going to need a histogram.

Start by making a new pipe with disease_testing and filter() the pipe to just the rows such that test_positive is equal to 1.

disease_testing <- tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = rbinom(n = 10000, size = 1, prob = 0.05)))

disease_testing
disease_testing <- tibble(has_disease = rbinom(n = 10000, size = 1, prob = 0.01)) %>%
  mutate(test_positive = if_else(condition = has_disease == 1,
                                 true = rbinom(n = 10000, size = 1, prob = 0.99),
                                 false = rbinom(n = 10000, size = 1, prob = 0.05)))

disease_testing %>%
  ...

Exercise 8

Add a ggplot() call to the pipe to make a histogram that maps the has_disease variable to the x-axis. Also, set the binwidth of the histogram to 0.5.


Exercise 9

Use scale_x_continuous() just like we did in the joint distribution graph to reset the tick mark labels on the x-axis. Again, we want the breaks to be 0 and 1 and the labels to be "Negative" and "Positive."


Exercise 10

Next, use the same two-dot technique that we used in the previous section to normalize the distribution (we want to divide the count of each bar by the sum() of the counts of all the bars).


# Remember to replace "count" with "..count.." whenever you use it.
disease_testing %>%
  filter(test_positive == 1) %>%
  ggplot(mapping = aes(x = has_disease, y = ...)) +
    geom_histogram(binwidth = 0.5) +
    scale_x_continuous(breaks = 0:1, labels = c("Negative", "Positive"))

Great work! From this graph, we can see that you have between an 80% and 85% chance of not having the disease, even after you test positive.

Submit

submission_ui
submission_server()


davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.