library(tidyverse)
library(PPBDS.data)
library(learnr)
library(shiny)
library(rstanarm)
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
options(tutorial.storage="local")  


# This is for the tibble `our_data`

ID <- c(1:50)
Sex <- rep(c("male", "female"), 25) 
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
our_data <- tibble(ID, Sex, Age, Score)

ts <- tibble(id = 1:4, 
             test1 = 10:13, 
             test2 = 20:23,
             test3 = 30:33) 

Confirm Correct Package Version

Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."

packageVersion('PPBDS.data')

The answer should be ‘0.3.2.9009’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:

remove.packages('PPBDS.data')  
library(remotes)  
remotes::install_github('davidkane9/PPBDS.data')  

Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.

Name

question_text(
  "Student Name:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Email

``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )

## Preliminaries 

In this chapter, there were several functions thrown out there that might need a refresher or a proper introduction. Let's quickly familiarize ourselves with them so we are ready to take on the exercises that follow!

###

### Exercise 1

`drop_na()`removes rows with any missing values from a data set. Run `glimpse ()`to look at the `nhanes` data set. You will see scattered throughout the data set there are NA values.

```r

Exercise 2

Remove these NA's by creating a pipe and using drop_na().


nhanes %>% 
  drop_na()

Exercise 3

matrix() creates a matrix from the given set of values. A matrix is a rectangular array of data, shaped like a data frame or tibble, but containing only one type of data.

The following code below assigns the crated matrix to object m. Type "m" below the creation step, and run the code to examine the object.

m <- matrix(c(3, 4, 8, 9, 12, 13), ncol = 2)

Exercise 4

Consider this tibble, containing the test scores for three students on four tests.

ts <- tibble(id = 1:4, 
             test1 = 10:13, 
             test2 = 20:23,
             test3 = 30:33) 

We want to create a new column, called avg which is the average score, for the four students, on each exam. Run this code which tries to do that:

ts %>% 
  mutate(avg = mean(c(test1, test2, test3)))

Argg! That fails! We have just calculated a single number --- the mean for all exams for all students --- and then assigned it to avg. We need something which allows us to calculate something across the rows of the tibble.

Exercise 5

rowwise(), and helper functions like c_across(), allow you to compute on a tibble row by row. Run the code chunk below to seerowwise() do its job!

ts %>% 
  rowwise() %>% 
  mutate(avg = mean(c_across(test1:test3)))

Exercise 6

c_across() is designed to work alongside rowwise(). It allows you to easily select multiple variables. Calculate the sum and standard deviation of the test scores for each student.


ts %>%
  rowwise() %>%
  mutate(sum = ...,
         sd = ...)
ts %>%
  rowwise() %>%
  mutate(sum = sum(c_across(...)),
         sd = sd(c_across(...)))

Exercise 7

The function ungroup() removes grouping done by the group_by() function. Run the code below to see the tibble our_data, which has been grouped by sex.

Remember: group_by() does not change how the data looks apart from listing how the tibble is grouped!

# This is for the tibble `our_data`

ID <- c(1:50)
Sex <- rep(c("male", "female"), 25) 
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)
our_data<- tibble(ID, Sex, Age, Score)
our_data  %>%
  group_by(Sex)

Exercise 8

Great! Now we are going to show why using ungroup() is important. Let's say we wanted to calculate the average age and score of males and females.

Notice that ungroup() has been placed at the end of both calculations. Run the following code.

our_data %>%
  group_by(Sex) %>%
  mutate(m = mean(Age)) %>% 
  mutate(x = mean(Score)) %>%
  ungroup()

This code shows that the mean age (m), is 29.2 for males and 28.96 for females. It also shows us that the mean score (x) is 0.487 for males and 0.437 for females. For both calculations, the data is grouped by Sex and separate from one another.

Exercise 9

Instead of placing ungroup() at the end, we are going to place it in between the two mutate() functions.

our_data %>%
  group_by(Sex) %>%
  mutate(m = mean(Age)) %>% 
  ungroup()%>% 
  mutate(x = mean(Score)) 

Here our Mean age (m), is still 29.2 for males and 28.96 for females. However, our mean score (x) is .462 for every row. Because group_by(Sex) is removed by ungroup() after the first mutate() function, the mean score is calculated for all participants together ,which is what we do not want! Therefore, our first placement of ungroup() is correct.

Exercise 10

The function after_stat()replaces the old approaches of using either stat() or surrounding the variable names with ...

Run the following code chunk.

ggplot(mpg, aes(displ)) +
  geom_histogram(aes(y = after_stat(count/sum(count))))

Exercise 11

Now we are going to explore different ways of referring to columns. Lets refresh our memory of the function select(). Create a pipe andselect() test1 from the tibble ts.

ts <- tibble(id = 1:4, 
             test1 = 10:13, 
             test2 = 20:23,
             test3 = 30:33) 
ts 
ts %>%
  select("..")

Exercise 12

We can also select() multiple variables using different methods. Let's say we wanted to select test1, test2, and test3. One way would be to type out all the names. Run the code to see this approach in action.

ts %>%
  select(test1, test2, test3)

Exercise 13

However, being the tech wizards that we are, we want to make things as simple as possible. Recall the : operator, which creates a shortcut! Run the following code.

ts %>%
  select(test1:test3)

Exercise 14

Awesome! We can also use[ ] to extract columns. Extract the first column of the trains data set using this method.

Note: When you use [ ] , place a comma before the number column you want to extract.


trains [, ...]

NHANES

Let's create this graph.

ch7 <- nhanes %>% 
  select(age, gender, weight, survey) %>%
  filter(age >= 18) %>% 
  drop_na()

ch7 %>%
  ggplot(aes(x = weight, color = gender)) + 
  geom_density() + 
  labs(x = "Weight",
       title = "Weight by Gender in Nhanes Dataset")

Exercise 1

Create a pipe using the nhanes dataset and select the variables age, gender, weight, survey.


Exercise 2

Great! Now continue your pipe. filter age so it refers to only adults (age equal to or above 18) and apply drop_na().


nhanes %>% 
  select(age, gender, height, survey) %>%
  filter(... >= ...) %>% 
  ...

Exercise 3

Excellent. Save the work you have done so far to an object called ch7.


Exercise 4

Now let's pipe in code to create a graph using ggplot()and geom_density. Map the x-axis to weight and the color aesthetic to gender. Label your x-axis "Weight" and title your graph "Weight by Gender in NHANES Dataset".


ch7 %>%
  ggplot(aes(x = ..., color = ....)) + 
  geom_density() + 
  labs(x = ...,
       title = ...)

Wisdom

Consider the graph below.

ch7 <- nhanes %>% 
  select(age, gender, weight, survey) %>%
  filter(age >= 18, gender == "Female") %>% 
  drop_na()

ch7 %>%
  ggplot(aes(x = weight, color = gender)) + 
  geom_density() + 
  labs(x = "Weight",
       title = "Female Weight in NHANES Data")

Exercise 1

As you can tell, the graph above displays "Female Weight by Gender in NHANES Data".

Let's say our motive behind generating this graph was to answer the question: What is the probability that the next female adult we meet will weigh more than 100 pounds? Using the concept of Wisdom, write a paragraph about whether or not this data is relevant for the problem we face. See The Primer for guidance.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Justice

Exercise 1

$$ y_i = \mu + \epsilon_i $$

Recall the mathematical relationship above. What are the two unknown parameters here? You do not need to figure out how to display the symbols in your answer, just write their names, like "phi," mu," "sigma," "epsilon," "delta," and so on.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Exercise 2

Great! Now, in your own words, write two sentences describing what the parameters mean.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Exercise 3

Now write two sentences about which parameter we care most about and why.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Courage

Let's create a Bayesian model of height.

# Creation of objects needed for this group of questions.

ch7_female <- nhanes %>%
  filter(survey == 2009, gender == "Female", age >= 18) %>%
  select(height) %>%
  drop_na()

fit_obj <- stan_glm(data = ch7_female,
                    height ~ 1, 
                    family = gaussian(), 
                    refresh = 0)

pp <- posterior_predict(fit_obj)

Exercise 1

Run the following code.

ch7_female <- nhanes %>%
  filter(survey == 2009, gender == "Female", age >= 18) %>%
  select(height) %>%
  drop_na()

Exercise 2

Use stan_glm() to create a simple Bayesian model. Set data to ch7_female, family to gaussian(), and refresh to 0. The formula argument should be `height ~ 1.

library(rstanarm)
stan_glm(data = ...,
         formula = ..., 
         family = ..., 
         refresh = ...)

Now we have full posteriors for $\sigma$ and $\mu$! However, we can't show the entire distribution because this is a simple printed summary.

Exercise 3

Copy and paste your work from the last question, and assign it to an object named fit_obj.


Use the assignment operator: <- 

Exercise 4

Recall that the most important single number related to a distribution is some measure of its location. The two common measures to do such are mean and median.

In a single sentence, using your own words, describe why we use the median here.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Exercise 5

Now, in two sentences, use your own words to define MAD_SD.

question_text(
  "Answer:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Exercise 6

Instead of printing out the whole model like before, let's just use print to print out the parameters of the model. Set the argument detail to FALSE.


print(fit_obj, detail = ...)

Exercise 7

Awesome! Now let's create the following posterior distribution.

p_plot <- fit_obj %>% 
  as_tibble() %>% 
  rename(mu = `(Intercept)`) %>% 
  ggplot(aes(x = mu)) +
    geom_histogram(aes(y = after_stat(count/sum(count))), 
                   binwidth = 0.01, 
                   color = "white") +
    labs(title = "Posterior Probability Distribution",
         subtitle = "Average height among American adult women in 2009",
         x = "Height in Centimeters",
         y = "Probability") +
    theme_classic()

p_plot

First, create a pipe starting with fit_obj and then continuing with as_tibble().


fit_obj %>% 
  ...

Exercise 8

Cool. Now copy and paste your work from the previous question and continue the pipe. Use rename() and rename (Intercept) as mu. Recall that column names can be anything you want. But, if you use weird things --- like a paranthesis --- then you need to put backticks around the column names. Since that is a bother, we prefer column names like mu to column names like (Intercept).


fit_obj %>% 
  as_tibble() %>%
  rename(mu = `(Intercept)`) 

Exercise 9

Continue the pipe and use ggplot(). Map x to mu in the aes argument. Use geom_histogram() and map y to theafter_stat() function. Set its argument to count divided by the sum of count.


after_stat(count/sum(count)) accomplished the same normalization step 
which we did by hand in previous chapters using dplyr.
fit_obj %>% 
  as_tibble() %>% 
  rename(mu = `(Intercept)`) %>% 
  ggplot(aes(x = mu)) +
    geom_histogram(afterstat(.../sum(...)))

Exercise 10

Also, set binwidth to 0.01 and color to white within geom_histogram().


Exercise 11

Title your histogram "Posterior Probability Distribution" with the subtitle, "Average height among American adult women in 2009". Also name the x-axis "height" and the y-axis "probability". Add a layer and use theme_classic().


Temperance

Exercise 1

We have a model of female height in 2009. What can we do with it? Let's use it to answer the question: What is the probability that the next adult female we meet will be taller than 180 centimeters?

The below code chunk is from earlier. Add posterior_predict() and have its argument be fit_obj.

fit_obj <-stan_glm(data = ch7_female, 
                    height ~ 1, 
                    family = gaussian(), 
                    refresh = 0)
posterior_predict(...)

Exercise 2

Now save your work to an object called pp.


pp <- posterior_predict(...)

Exercise 3

pp has 4,000 rows. This is because stan_glm() by default gives us 4,000. Let's put the posterior predictions from the first column of pp into a tibble for convenience. Note that all the other columns in pp are also draws from the same posterior.

Create a tibble with one variable, pred, which is set equal to the first column of pp.


tibble(pred = pp[...])
Remember you need to type a comma and then type the number 
of the column you want inside the brackets of `pp`. 

Exercise 4

Continue the pipe. Use mutate and create a new variable gt_180. gt_180 should use an ifelse statement. Within ifelse, set prediction to a value greater than 180. Also, include both TRUE and FALSE arguments.


tibble(pred = pp[, 1]) %>% 
  mutate(... = ifelse(pred > ..., TRUE, FALSE))

Exercise 5

Lets continue the pipe even more. Use summarize to make the answer equal to the sum of gt_180 divided by the the function n().


tibble(pred = pp[, 1]) %>% 
  mutate(gt_180 = ifelse(pred > 180, TRUE, FALSE)) %>% 
  summarize(... = sum(...) / ...)

Submit

submission_ui
submission_server()


davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.