In davidkane9/PPBDS.data: Data for Preceptor's Primer for Bayesian Data Science

library(tidyverse)
library(lubridate)
library(stringr)
library(learnr)
library(skimr)
library(shiny)
library(PPBDS.data)
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
options(tutorial.exercise.timelimit = 60, tutorial.storage="local") 

# Needed for later sections of the tutorial 

library(fivethirtyeight)
library(nycflights13)
library(ggthemes)

Confirm Correct Package Version

Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."

packageVersion('PPBDS.data')

The answer should be ‘0.3.2.9004’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:

remove.packages('PPBDS.data')  
library(remotes)  
remotes::install_github('davidkane9/PPBDS.data')

Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.

Name

``` {r name, echo=FALSE} question_text( "Student Name:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )

## Email
###

``` {r email, echo=FALSE}
question_text(
  "Email:",
  answer(NULL, correct = TRUE),
  incorrect = "Ok",
  try_again_button = "Modify your answer",
  allow_retry = TRUE
)

Introduction

Purpose of this tutorial is to provide several full scale exercises, each using with a real data set --- one which requires some cleaning and organizing --- and finishing with a professional quality plot. We start each group by showing you the plot which we will replicate by the end.

National Health and Nutrition Survey

nhanes is a data set included in the PPBDS.data package. It includes data from the "National Health and Nutrition Examination Survey," which contains the personal and physical information of 10,000 Americans from two surveys in 2009 and 2011. The goal of this series of exercises is to build, line-by-line, the code needed to recreate this plot.

nhanes_plot <- nhanes %>%
  mutate(weight = weight * 2.2, height = height / 30.48) %>%
  select(gender, weight, height, bmi) %>%
  drop_na(weight, height) %>% 
  ggplot(mapping = aes(x = weight, y = height, color = bmi)) +
    geom_jitter() +
    geom_smooth(se = FALSE, color = "dodgerblue") +
    facet_wrap(~ gender) + 
    theme_clean() + 
    labs(y = "Height (feet)",
         x = "Weight (pounds)",
         title = "Height Versus Weight in the US",
         subtitle = "Relation between weight and height more positive for heavy men",
         caption = "Source: NHANES")

nhanes_plot

Exercise 1

Begin by running summary() on nhanes. Based on the results, try to guess what the units of measurement are for the weight and height variables.

Exercise 2

In the original tibble, weight is measured in kg and height in cm. Using mutate(), multiply weight by 2.2 and divide height by 30.48 to convert units into pounds and feet.

nhanes %>%
  mutate(weight = ..., height = ...)

Exercise 3

select() the gender, weight, height, and bmi columns.

... %>%
  select(..., ..., ..., ...)

Exercise 4

Use tidyr's drop_na() to remove all rows with a value of NA in either the weight or height columns.

... %>%
  drop_na(..., ...)

Exercise 5

Call ggplot() to make a jittered scatterplot that maps weight to the x-axis, height to the y-axis, and bmi to the color aesthetic.

# Use geom_jitter().

... %>% 
  gplot(data = ..., aes(x = ..., y = ..., color = ...))

Exercise 6

Add a trend line layer to your plot with geom_smooth(). Set the se argument of geom_smooth() to FALSE in order to remove the confidence interval, and change the color of the line to "dodgerblue."

# The se argument should be equal to FALSE.

... + 
  geom_smooth(se = ..., color = ...)

Exercise 7

Use facet_wrap() to facet the graph by gender.

... +
  facet_wrap(...)

Exercise 8

Finally, adjust the feel of the graph with theme_clean() from the ggthemes package.

Exercise 9

Great work! From this graph, we can see that children of both genders tend to grow very quickly without gaining much weight (this is the very steep slope at the beginning of the two graphs). However, after people reach between 100-120 pounds, weight gain becomes coupled with a substantially smaller increase in height. Additionally, both weight and height seem to be higher for males than for females on average. Finally, we can infer from looking at the color gradient that a high bmi (i.e. brighter colors) corresponds with higher weight and lower height.

To finish your plot, use labs() to give the graph a title and subtitle of your choice. Here, again, is our version.

nhanes_plot

Kenya Voting

The kenya data set from the PPBDS.data package records the data from a study in which poll stations in Kenya were assigned to either the control group or a group in which one or more methods --- an SMS reminder, canvassing, etc. --- were used to encourage voter registration. We will recreate this plot:

kenya_plot <- kenya %>%
  filter(treatment %in% c("control", "local", 
                          "SMS", "canvass")) %>%
  droplevels() %>%
  select(treatment, mean_age, reg_byrv13) %>%
  drop_na(mean_age) %>%
  mutate(age_half = ntile(mean_age, 2)) %>%
  group_by(treatment, age_half) %>%
  summarize(mean_turnout = mean(reg_byrv13, na.rm = TRUE)) %>% 
  ggplot(mapping = aes(x = fct_reorder(treatment, mean_turnout), 
                       y = mean_turnout)) +
    geom_col() +
    theme_bw() +
    labs(title = "Turnout Changes in a Kenyan Election",
         subtitle = "Local canvassing has the largest effect",
         x = "Treatment",
         y = "Change in Mean Turnout")

kenya_plot

Exercise 1

Start by glimpse()'ing the kenya data set.

glimpse(kenya)

Exercise 2

Pipe kenya into the count() function to count the number of poll stations in each treatment group.

kenya %>%
  count(...)

Exercise 3

Using a pipe and the %in% operator, filter() kenya to only the rows with a treatment of "control," "local," "SMS," or "canvass."

kenya %>%
  filter(treatment %in% ...)

kenya %>%
  filter(treatment %in% c(...))

Exercise 4

Because treatment is a factor, call droplevels() immediately after the filter() call to avoid future complications.

Exercise 5

select() the treatment, mean_age, and reg_byrv13 columns.

Exercise 6

Use tidyr's drop_na() to remove all rows with a value of NA in the mean_age column.

kenya %>%
  filter(treatment %in% c("control", "local", "SMS", "canvass")) %>%
  droplevels() %>%
  select(treatment, mean_age, reg_byrv13)
  drop_na(...)

Exercise 7

The dplyr ntile() function divides a continuous numerical value into categories depending on its size. Try running the following code to see how we can categorize the polling stations into 4 equally-sized groups based on the distance to the polling station.

kenya %>%
  mutate(distance_quartile = ntile(distance, 4))

Exercise 8

Use mutate() and ntile() to create the variable age_half, which categorizes the mean_age variable into 2 groups: the younger half, and the older half. Add this to the end of our full pipe.

# ntile(mean_age, 2) will categorize age into the younger half and the older
# half.

Exercise 9

Group the data by treatment and age_half.

... %>%
  group_by(...)

Exercise 10

Using summarize(), calculate the variable mean_turnout as the average value of reg_byrv13 in each group. Remember to set the na.rm argument of mean() to TRUE, as there are some NA values in the reg_byrv13 column.

# Don't forget the .groups argument to summarize(). Never ignore warning
# messages.

Exercise 11

Use ggplot() at the end of the pipe to make a bar graph with treatment on the x-axis and mean_turnout on the y-axis.

# Remember to use geom_col() instead of geom_bar() when you map something to the
# y-axis.

Exercise 12

Already, we can see that one of the treatments is significantly more effective at increasing turnout than others. Continue by using facet_wrap() to facet the data by age_half.

Exercise 13

To make the graph easier to read, use fct_reorder() to reorder the treatment variable by mean_turnout. Note, you will no longer use the facet_wrap(), as this will cause issues with the ordering of the treatment categories.

...  %>% 
  ggplot(mapping = aes(x = fct_reorder(..., ...), y = ...)) +
    geom_col()

Exercise 14

Improve the aesthetics of the graph by changing the theme to theme_bw().

Exercise 15

Great job! This graph allows us to see that the presence of a local administrator at the polling location is by far the most effective strategy for increasing voter turnout. In addition, older voters (category 2) are not only more likely to vote than younger voters, but they are also influenced to a greater extent by the presence of a local administrator.

To finish your plot, use labs() to change the x-axis label to "Treatment." In addition, give the graph a title, subtitle and y-axis label of your choosing. Your plot should look something like this:

kenya_plot

Seguro Popular

The sps data set from the PPBDS.data package contains information about a study done on a popular Mexican health insurance program, Seguro Popular. In the study, some Mexican health clusters were randomly "treated." The treatment consisted of encouragement for people in that health cluster to enroll in the health insurance program, as well as funds to improve health facilities in that cluster. We will create this plot:

sps_plot <- sps %>%
  filter(education %in% c("preschool", "secondary", 
                          "high school", "college")) %>%
  select(-c(health_exp_1m, t2_health_exp_1m)) %>%
  mutate(change_in_expenses = t2_health_exp_3m - health_exp_3m) %>%
  group_by(treatment, education) %>%
  summarize(mean_change_in_expenses = mean(change_in_expenses)) %>%
  mutate(treatment = as.factor(treatment)) %>% 
  ggplot(mapping = aes(x = fct_reorder(education, mean_change_in_expenses), 
                       y = mean_change_in_expenses, fill = treatment)) +
    geom_col(position = position_dodge(preserve = "single")) + 
    scale_fill_brewer(palette = "Paired", labels = c("No", "Yes")) +
    theme_minimal() +
    labs(x = "Education",
         y = "Average Change in Expenses",
         title = "Spending Changes and Seguro Popular",
         subtitle = "Honestly, not sure what is going on here . . .",
         caption = "Source: King et al. (2009)")

sps_plot

Exercise 1

Start by calling glimpse() on sps.

glimpse(...)

Exercise 2

Using a pipe and the %in% operator, filter() sps to the observations with an education of "preschool," "secondary," "high school," or "college."

sps %>%
  filter(education %in% ...)

sps %>% 
  filter(education %in% c(...))

Exercise 3

Use select() to remove the columns health_exp_1m and t2_health_exp_1m (these measure health expenses over the past month, whereas health_exp_3m and t2_health_exp_3m measure health expenses over the past 3 months).

# Remember that select(-column_name) returns all columns except for that column.

# Consider using c()

Exercise 4

Use mutate() to create a new variable, change_in_expenses, equal to health_exp_3m subtracted from t2_health_exp_3m (this measures the change in expenses after the treatment period).

Exercise 5

Group the data by treatment and education.

# Use the group_by() function.

Exercise 6

Using summarize(), calculate mean_change_in_expenses, the average of the change_in_expenses for each group.

# Use the summarize() function and the helper function mean(). Your pipe should
# look like:

sps %>%
  filter(education %in% c("preschool", "secondary", "high school", "college")) %>%
  select(-c(health_exp_1m, t2_health_exp_1m)) %>%
  mutate(change_in_expenses = t2_health_exp_3m - health_exp_3m) %>%
  group_by(treatment, education) %>%
  summarize(mean_change_in_expenses = mean(change_in_expenses))

Exercise 7

use ggplot() to make a bar graph that maps education to the x-axis, mean_change_in_expenses to the y-axis, and treatment to the fill aesthetic.

# Remember to use geom_col() instead of geom_bar() when mapping a variable to
# the y-axis.

Exercise 8

Use the position_dodge function with preserve equal to "single" to create a dodged barplot. It does not work well! Don't worry, we will fix in in the next exercise.

# You can use the position argument in the geom_col() layer

Exercise 9

Because treatment column is an integer and not a factor, we can't group our data by it. To fix this problem, use mutate() and as.factor() to change treatment to a factor.

Exercise 10

Use fct_reorder() to reorder the four educations by mean_change_in_expenses.

Exercise 11

Change the aesthetics of the plot by using scale_fill_brewer() with the "Paired" palette. In addition, adjust the labels argument of scale_fill_brewer() so that the legend's labels are equal to the vector ("No", "Yes").

...  + 
    scale_fill_brewer(palette = ..., labels = ...)

Exercise 12

Add theme_minimal() to change the style of the graph.

Exercise 13

Great work! The first thing to notice from our graph is that all of the bars are positive, implying that all medical costs (regardless of treatment) went up over time. However, it is also clear that the treatment caused costs to rise less for all education groups. Finally, it is worth noting that the education category preschool is the only category that the treatment did not significantly affect costs. Since the treatment was designed to help less educated people the most, this is definitely a fact worth investigating.

To finish your plot, use labs() to change the x-axis label to "education." Also, give the graph a title and subtitle of your choice. Here, again, is our version:

sps_plot

Shaming

The shaming data set chronicles a study that attempted to measure the impact of social pressure on voting. Nearly 350,000 people in Michigan were randomly assigned to 1 of 5 treatment groups before the 2006 Michigan primary. All 5 groups were sent mail before the primary: the "Civic Duty" group had an extra reminder that voting was a civic responsibility, the "Hawthorne" group was told that whether or not they voted would be in the public record, the "Self" group was actually sent the public record of whether or not they voted in 2004, and the "Neighbors" group was sent both their voting record and their neighbors' voting record from 2004. We will reproduce this plot.

shaming_plot <- shaming %>%
  select(primary_04, primary_06, treatment) %>%
  mutate(primary_04 = if_else(condition = 
                                str_detect(primary_04, "Yes"),
                                true = 1L,
                                false = 0L)) %>%
  pivot_longer(cols = c(primary_04, primary_06), 
               names_to = "year", values_to = "voted") %>%
  group_by(treatment, year, voted) %>%
  summarize(num_voters = n(), .groups = "drop") %>%
  mutate(voted = if_else(condition = voted == 1,
                         true = "did_vote",
                         false = "did_not_vote")) %>%
  pivot_wider(names_from = voted, values_from = num_voters)  %>%
  mutate(pct_voted = did_vote / (did_vote + did_not_vote)) %>% 
  ggplot(mapping = aes(x = fct_reorder(treatment, pct_voted), 
                       y = pct_voted, fill = year)) +
    geom_col(position = position_dodge(preserve = "single")) +
    coord_cartesian(ylim = c(0.2, 0.45)) +
    labs(x = "Treatment",
         y = "Voting Percentage",
         title = "Voting Rates Based on Mailings",
         subtitle = "Shaming people makes them more likely to vote",
         caption = "Source: Gerber, Green, and Larimer (2008).") +
    theme_fivethirtyeight()

shaming_plot

Exercise 1

Start by running skim() on shaming. See if you can find out how many of the participants voted in general_04. Also, pay close attention to the data types of each of the variables.

Exercise 2

Make the data set easier to visualize by limiting the columns to treatment, primary_04, and primary_06.

# Use the select() function.

Exercise 3

Use pivot_longer() to map the names of the primary_04 and primary_06 columns to a new column, "year," and the values of those two columns to a new column, "voted." What goes wrong?

...  %>%
  pivot_longer(cols = ..., names_to = ..., values_to = ...)

Exercise 4

pivot_longer() won't let us combine one column of <chr> data with one column of <int> data. Therefore, we're going to have to change the data type of one of the columns.

Before we do that, though, we need to understand the dplyr if_else() function. if_else() is used alongside mutate() to create 2 different values depending on whether a condition is TRUE or FALSE. if_else() is a more modern, and somewhat safer, version of ifelse().

Run the following code to see how we can categorize the birth_year column into 2 groups based on whether the person was born before 1950.

shaming %>%
  select(birth_year) %>%
  mutate(born_before_1950 = if_else(condition = birth_year < 1950,
                                 true = "yes",
                                 false = "no"))

Exercise 5

Now, use if_else() before the pivot_longer() statement to mutate() the primary_04 column. The condition should should be a str_detect() of the string "Yes" in primary_04. If the condition is true, return 1 as an integer, and if the condition is false, return 0 as an integer. Don't forget to pivot_longer() once again.

# Remember to append the letter "L" to make something an integer; e.g. 22L.

# The condition of the if_else() should be str_detect(primary_04, "Yes").

Exercise 6

Great job! To continue, group the data by treatment, year, and voted.

# Use the group_by() function.

Exercise 7

Using summarize(), create the variable num_voters that simply counts the number of rows in each group.

# Remember that n() counts the number of rows in each group.

Exercise 8

The voted column is a bit difficult to understand, as a numerical value (0 or 1) is used to represent an idea (the person did or did not vote). Use another if_else() statement to mutate() the voted column. Condition your if_else() on whether voted is equal to 1: true should return "did_vote," and false should return "did_not_vote."

# The condition of the if_else() should be voted == 1.

# Your pipe should look something like this:

shaming %>%
  select(primary_04, primary_06, treatment) %>%
  mutate(primary_04 = if_else(condition = 
                                str_detect(primary_04, "Yes"),
                                true = 1L,
                                false = 0L)) %>%
  pivot_longer(cols = c(primary_04, primary_06), 
               names_to = "year", values_to = "voted") %>%
  group_by(treatment, year, voted) %>%
  summarize(num_voters = n(), .groups = "drop") %>%
  mutate(voted = if_else(condition = voted == 1,
                         true = "did_vote",
                         false = "did_not_vote"))

Exercise 9

Now, we want did_vote and did_not_vote to be their own columns. Call a pivot_wider() function that gets its column names from voted and its values from num_voters.

...  %>%
  pivot_wider(names_from = ..., values_from = ...)

Exercise 10

mutate() a new variable, pct_voted, equal to the number of people who did_vote divided by the total number of people.

# Note that the total number of people is equal to (did_vote + did_not_vote).

Exercise 11

Call ggplot() to make a bar chart that maps treatment to the x-axis, pct_voted to the y-axis, and year to the fill aesthetic.

# Remember to use geom_col() instead of geom_bar() when you map something to the
# y-axis.

Exercise 12

A stacked barplot is very hard to analyze here. Use the position_dodge() function with preserve set to "single" to make the graph a dodged barplot.

# Set the position argument of geom_col()

Exercise 13

Use fct_reorder() to reorder the five different treatments by pct_voted. In other words, you need to change the value of x, probably using fct_reorder(), within the call to aes().

Exercise 14

Finally, use the coord_cartesian function to adjust the graph's zoom by setting ylim to the vector (0.2, 0.45).

Exercise 15

From this graph, the first thing to notice is that all five of the treatment categories had approximately the same average voter turnout in 2004. We can also see that turnout decreased across the board in 2006. However, each additional level of social shaming led to a substantial increase in voter turnout, with the "Neighbors" treatment particularly effective.

Clean up your graph with some sensible labels. Again, here is our version.

shaming_plot

National Election Studies

nes, short for "National Election Studies," contains the personal and political information of almost 40,000 American voters, as well as whether or not they voted in that year's presidential election. The goal of this exercise is to replicate this plot:

nes_plot <- nes %>%
  select(year, education, pres_appr, ideology, voted) %>%
  filter(education %in% c("Some Highschool", "Highschool", 
                          "Some College", "Adv. Degree")) %>%
  droplevels() %>%
  mutate(opinion = as.factor(if_else(
    condition = str_detect(pres_appr, "prove"),
    true = "has_opinion",
    false = "no_opinion"))) %>%
  drop_na() %>%
  group_by(year, voted, opinion, education) %>%
  summarize(count = n()) %>%
  pivot_wider(names_from = voted, values_from = count) %>%
  mutate(turnout = Yes / (Yes + No)) %>%
  drop_na() %>%
  ggplot(aes(x = year,
             y = turnout,
             color = fct_reorder2(education, year,
                                  turnout, .fun = first2),
             linetype = fct_reorder2(education, year, 
                                     turnout, .fun = first2))) +
  geom_line() +
  facet_wrap(~ opinion) +
  scale_color_brewer(palette = "Spectral", name = "education") +
  scale_linetype(name = "education") +
  labs(title = "Turnout and Opinions Over Time",
       subtitle = "People with opinions are more likely to vote",
       x = "Turnout Percentage",
       y = "Year")

nes_plot

Good luck!

Exercise 1

Start by running skim() on nes. See if you can figure out the first and last years in the data set. You may need to load the skimr package first.

Exercise 2

Use levels() and the $ operator to explore the levels of the education variable.

# The factor that we'd like to investigate is nes$education.

Exercise 3

Select the year, education, pres_appr, ideology, and voted columns from nes.

Exercise 4

Continue by filtering the tibble such that education is either "Some Highschool," "Highschool," "Some College," or "Adv. Degree."

# Remember to use the %in% operator along with the c() function.

Exercise 5

After filtering a factor, always remember that the levels of the factor have not been removed. To fix this, add a call in the pipe that drops (i.e. permanently removes) all deleted levels.

# Use the droplevels() function.

Exercise 6

The dplyr if_else() function can be used alongside mutate() and as.factor() to create a factor with two levels depending on whether a certain condition is met. Run the code below to see how we can use if_else() to separate the nes data set into 2 groups based on whether the value of ideology is positive.

nes %>%
  select(year, education, pres_appr, ideology, voted) %>%
  filter(education %in% c("Some Highschool", "Highschool", 
                          "Some College", "Adv. Degree")) %>%
  droplevels() %>%
  mutate(pos_ideology = as.factor(if_else(condition = ideology > 0,
                                          true = "ideology_is_positive",
                                          false = "ideology_not_positive")))

Exercise 7

Now, use if_else() and mutate() tto create a new variable, opinion. opinion should be a factor that equals "has_opinion" if that person either approves or disapproves of the president, but "no_opinion" otherwise. To do this, use the if_else() function with the condition being detection of the pattern "prove" in the string pres_appr. You can remove from your pipe the creation of pos_ideology since we won't be using that variable.

# Remember to use as.factor() before if_else().

# The function we want to use for the condition is str_detect().

nes %>%
  select(year, education, pres_appr, ideology, voted) %>%
  filter(education %in% c("Some Highschool", "Highschool", 
                          "Some College", "Adv. Degree")) %>%
  droplevels() %>%
  mutate(opinion = as.factor(if_else(condition = str_detect(...),
                                     true = ...,
                                     false = ...)))

Exercise 8

Our goal for the next few exercises will be to calculate voter turnout based on education and whether or not the voter has an opinion of the president at that time.

To start, group the tibble by year, voted, opinion, and education. Then, use summarize() to make a new variable, count, that simply counts the number of people (i.e. rows) in each group. What goes wrong?

# Remember that the n() function counts the number of rows in each group.

Exercise 9

If we look closely, many of our rows have a value of NA for either voted or opinion. Furthermore, n() doesn't have an na.rm argument. To remove the NAs, use tidyr's drop_na() to drop any rows with missing values for any variable before the call to group_by() and summarize().

nes %>%
  select(year, education, pres_appr, ideology, voted) %>%
  filter(education %in% c("Some Highschool", "Highschool", 
                          "Some College", "Adv. Degree")) %>%
  droplevels() %>%
  mutate(opinion = as.factor(if_else(condition = str_detect(pres_appr, "prove"),
                                     true = "has_opinion",
                                     false = "no_opinion"))) %>%
  drop_na() %>%
  group_by(year, voted, opinion, education) %>%
  summarize(count = ...)

Exercise 10

Our goal is to calculate voter turnout within each category, the number of people who voted divided by the total number of people. Therefore, we want Yes and No (currently in the voted column) to be their own columns with the values that are currently in the count column.

Use one of the pivot_ functions to tidy our data in this way.

# Because we're making new columns from values, we want the pivot_wider()
# function.

... %>%
  pivot_wider(names_from = ..., values_from = ...)

Exercise 11

Great job! Now, to calculate turnout, use mutate() to create another variable, turnout, equal to the number of people who voted (the number of Yes) divided by the total number of people in each category. To make sure you are caught up, we provide the pipe as it should appear by this stage. You need to add a line to create turnout. But look at the data when you do! See the rows with are NA for turnout? That will mess up our graphics. So, use drop_na() again to remove them.

nes %>%
  select(year, education, pres_appr, ideology, voted) %>%
  filter(education %in% c("Some Highschool", "Highschool", 
                          "Some College", "Adv. Degree")) %>%
  droplevels() %>%
  mutate(opinion = as.factor(if_else(condition = str_detect(pres_appr, "prove"),
                                     true = "has_opinion",
                                    false = "no_opinion"))) %>%
  drop_na() %>%
  group_by(year, voted, opinion, education) %>%
  summarize(count = n()) %>%
  pivot_wider(names_from = voted, values_from = count)

# Remember that the total number of people in each group is (Yes + No).

Exercise 12

Next, use ggplot() to make a line graph that maps year to the x-axis, turnout to the y-axis, and education to both the color and linetype aesthetics.

Exercise 13

Use facet_wrap to facet the graph by opinion.

Exercise 14

This graph is starting to look pretty good, but the ordering of the legend is a bit confusing. Try using an fct_ function to reorder the legend (both the color and linetype aesthetics) so that Adv. Degree is first, Some College is second, and so on. Something goes wrong with this first attempt, but we will fix it in the next exercise.

# Because this is a line graph, we want to use fct_reorder2().

Exercise 15

Although 3 of the 4 education statuses are reordered successfully, Adv. Degree is left on the bottom because it doesn't appear in the second graph. To fix this problem, change the .fun argument of both fct_reorder2() calls to first2 (the default is last2).

... %>%
  ggplot(aes(x = year, 
             y = turnout,
             color = fct_reorder2(education, year, 
                                  turnout, .fun = ...),
             linetype = fct_reorder2(education, year,
                                     turnout, .fun = ...))) +
  geom_line() +
  facet_wrap(~ opinion)

Exercise 16

Our graph is looking good, but the very long legend title takes up a lot of space. To fix this problem (and to improve the colors of the graph), add a scale_color_brewer() function with the "Spectral" palette as well as a scale_linetype function. Use both scale functions to set the name of the legend to "education."

# The scale_color_brewer() function should take 2 arguments, but the
# scale_linetype() function should only take 1.

Exercise 17

Amazing work! From this graph, we see that Americans with a higher level of education tend to vote more often. Furthermore, Americans with an opinion of the president generally have higher voter turnout than Americans without an opinion of the president. Finally, pretty much 100% of Americans with an advanced degree are able to form an opinion of the president (this is why there is no Advanced Degree line in the no_opinion graph: there are too few data points in the table).

To finish your plot, use labs() to give the graph a title and subtitle of your choosing, along with some axis labels. Here is our answer. It is OK if your answer looks somewhat different from ours. Something sure seems fishy about that perfectly flat red line on the right . . . Indeed, there are a couple of things about our plot which could be improved . . . Why not improve them?

nes_plot

Airline Safety

We will be focusing on the data set airline_saftey from the fivethirtyeight package. We will recreate this plot step-by-step.

as_plot <- airline_safety %>%
  pivot_longer(c(incidents_85_99, incidents_00_14, fatal_accidents_00_14,
                 fatal_accidents_85_99, fatalities_00_14, fatalities_85_99),
               names_to = "type_date",
               values_to = "count") %>%
  separate(type_date, c("type", "date"), sep = "s_") %>%
  filter(date == "00_14", count > 0, type == "incident") %>%
  mutate(safety_value = avail_seat_km_per_week / count) %>%
  arrange(safety_value) %>%
  ggplot(aes(x = airline, y = safety_value)) +
    geom_col() +
    theme(axis.text = element_text(size = 5, angle = 90)) +
    labs(title = "Airline Safety",
         subtitle = "Airlines with more long-haul routes are safer",
         x = "Airline",
         y = "Distance per Incident")

as_plot

Exercise 1

Start by taking a glimpse of airline_safety:

Exercise 2

Do you notice anything off about the airline_saftey data? It's untidy! Over the next few questions, we are going to fix this. Let's start by making the table "longer." Use pivot_longer() to pivot the columns incidents_85_99, incidents_00_14, fatal_accidents_00_14,fatal_accidents_85_99, fatalities_00_14 and fatalities_85_99 into the two new columns type_date and count.

airline_safety %>%
  pivot_longer(c(..., ..., ...,..., ..., ...),
               names_to = "...", values_to = "...")

Exercise 3

Nice! Do you see what our next step has to be? In our type_date column, we have data on both the type of incident and the date range in which it occurred. This should really be two columns. Let's use the separate() to separate type_date into the two columns type and date. Separate on the s_ characters by setting the sep argument inside of separate() to "s_".

... %>%
   separate(type_date, c("...", "..."), sep = "...")

Exercise 4

Fantastic! Now that our data is tidy, we can get to work. Add a filter() to the pipe so that we narrow down our data set to data where date is "00_14", type is "incident", and count is greater than 0.

Exercise 5

Nice! Now use mutate() to create the new variable safety_value and set it equal to avail_seat_km_per_week divided by count.

Exercise 6

Nice! Now arrange the data by safety_value.

Exercise 7

Great! Let's now use our wrangled data to create a bar plot using geom_col() with airline on the x axis and safety_value on the y axis.

Exercise 8

Awesome! As you can see, the airline names on the y axis are crowded and basically impossible to read. Let's change this by using theme() and setting the x axis text size to 5 and angle to 90.

... +
  theme(... = element_text(size = ..., angle = ...))

Exercise 9

Great! Now add some labels to finish our plot off! Here is our version. As always, it is OK if yours looks different. Try to make it better!

as_plot

Submit

submission_ui

submission_server()

davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

davidkane9/PPBDS.data Data for Preceptor's Primer for Bayesian Data Science

In davidkane9/PPBDS.data: Data for Preceptor's Primer for Bayesian Data Science

Confirm Correct Package Version

Name

Introduction

National Health and Nutrition Survey

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Kenya Voting

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Seguro Popular

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Shaming

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

National Election Studies

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Airline Safety

davidkane9/PPBDS.data
Data for Preceptor's Primer for Bayesian Data Science