library(tidyverse) library(lubridate) library(stringr) library(learnr) library(skimr) library(shiny) library(PPBDS.data) knitr::opts_chunk$set(echo = FALSE, message = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage="local") # Needed for later sections of the tutorial library(fivethirtyeight) library(nycflights13) library(ggthemes)
Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."
packageVersion('PPBDS.data')
The answer should be ‘0.3.2.9004’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:
remove.packages('PPBDS.data') library(remotes) remotes::install_github('davidkane9/PPBDS.data')
Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.
``` {r name, echo=FALSE} question_text( "Student Name:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
## Email ### ``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
Purpose of this tutorial is to provide several full scale exercises, each using with a real data set --- one which requires some cleaning and organizing --- and finishing with a professional quality plot. We start each group by showing you the plot which we will replicate by the end.
nhanes
is a data set included in the PPBDS.data package. It includes data from the "National Health and Nutrition Examination Survey," which contains the personal and physical information of 10,000 Americans from two surveys in 2009 and 2011. The goal of this series of exercises is to build, line-by-line, the code needed to recreate this plot.
nhanes_plot <- nhanes %>% mutate(weight = weight * 2.2, height = height / 30.48) %>% select(gender, weight, height, bmi) %>% drop_na(weight, height) %>% ggplot(mapping = aes(x = weight, y = height, color = bmi)) + geom_jitter() + geom_smooth(se = FALSE, color = "dodgerblue") + facet_wrap(~ gender) + theme_clean() + labs(y = "Height (feet)", x = "Weight (pounds)", title = "Height Versus Weight in the US", subtitle = "Relation between weight and height more positive for heavy men", caption = "Source: NHANES") nhanes_plot
Begin by running summary()
on nhanes
. Based on the results, try to guess what the units of measurement are for the weight
and height
variables.
In the original tibble, weight
is measured in kg and height
in cm. Using mutate()
, multiply weight
by 2.2 and divide height
by 30.48 to convert units into pounds and feet.
nhanes %>% mutate(weight = ..., height = ...)
select()
the gender
, weight
, height
, and bmi
columns.
... %>% select(..., ..., ..., ...)
Use tidyr's drop_na()
to remove all rows with a value of NA
in either the weight
or height
columns.
... %>% drop_na(..., ...)
Call ggplot()
to make a jittered scatterplot that maps weight
to the x-axis, height
to the y-axis, and bmi
to the color aesthetic.
# Use geom_jitter().
... %>% gplot(data = ..., aes(x = ..., y = ..., color = ...))
Add a trend line layer to your plot with geom_smooth()
. Set the se
argument of geom_smooth()
to FALSE in order to remove the confidence interval, and change the color
of the line to "dodgerblue."
# The se argument should be equal to FALSE.
... + geom_smooth(se = ..., color = ...)
Use facet_wrap()
to facet the graph by gender
.
... + facet_wrap(...)
Finally, adjust the feel of the graph with theme_clean()
from the ggthemes package.
Great work! From this graph, we can see that children of both genders tend to grow very quickly without gaining much weight (this is the very steep slope at the beginning of the two graphs). However, after people reach between 100-120 pounds, weight gain becomes coupled with a substantially smaller increase in height. Additionally, both weight and height seem to be higher for males than for females on average. Finally, we can infer from looking at the color gradient that a high bmi
(i.e. brighter colors) corresponds with higher weight and lower height.
To finish your plot, use labs()
to give the graph a title and subtitle of your choice. Here, again, is our version.
nhanes_plot
The kenya
data set from the PPBDS.data package records the data from a study in which poll stations in Kenya were assigned to either the control group or a group in which one or more methods --- an SMS reminder, canvassing, etc. --- were used to encourage voter registration. We will recreate this plot:
kenya_plot <- kenya %>% filter(treatment %in% c("control", "local", "SMS", "canvass")) %>% droplevels() %>% select(treatment, mean_age, reg_byrv13) %>% drop_na(mean_age) %>% mutate(age_half = ntile(mean_age, 2)) %>% group_by(treatment, age_half) %>% summarize(mean_turnout = mean(reg_byrv13, na.rm = TRUE)) %>% ggplot(mapping = aes(x = fct_reorder(treatment, mean_turnout), y = mean_turnout)) + geom_col() + theme_bw() + labs(title = "Turnout Changes in a Kenyan Election", subtitle = "Local canvassing has the largest effect", x = "Treatment", y = "Change in Mean Turnout") kenya_plot
Start by glimpse()
'ing the kenya
data set.
glimpse(kenya)
Pipe kenya
into the count()
function to count the number of poll stations in each treatment
group.
kenya %>% count(...)
Using a pipe and the %in%
operator, filter()
kenya
to only the rows with a treatment
of "control," "local," "SMS," or "canvass."
kenya %>% filter(treatment %in% ...)
kenya %>% filter(treatment %in% c(...))
Because treatment
is a factor, call droplevels()
immediately after the filter()
call to avoid future complications.
select()
the treatment
, mean_age
, and reg_byrv13
columns.
Use tidyr's drop_na()
to remove all rows with a value of NA
in the mean_age
column.
kenya %>% filter(treatment %in% c("control", "local", "SMS", "canvass")) %>% droplevels() %>% select(treatment, mean_age, reg_byrv13) drop_na(...)
The dplyr ntile()
function divides a continuous numerical value into categories depending on its size. Try running the following code to see how we can categorize the polling stations into 4 equally-sized groups based on the distance
to the polling station.
kenya %>% mutate(distance_quartile = ntile(distance, 4))
Use mutate()
and ntile()
to create the variable age_half
, which categorizes the mean_age
variable into 2 groups: the younger half, and the older half. Add this to the end of our full pipe.
# ntile(mean_age, 2) will categorize age into the younger half and the older # half.
Group the data by treatment
and age_half
.
... %>% group_by(...)
Using summarize()
, calculate the variable mean_turnout
as the average value of reg_byrv13
in each group. Remember to set the na.rm
argument of mean()
to TRUE
, as there are some NA
values in the reg_byrv13
column.
# Don't forget the .groups argument to summarize(). Never ignore warning # messages.
Use ggplot()
at the end of the pipe to make a bar graph with treatment
on the x-axis and mean_turnout
on the y-axis.
# Remember to use geom_col() instead of geom_bar() when you map something to the # y-axis.
Already, we can see that one of the treatments is significantly more effective at increasing turnout than others. Continue by using facet_wrap()
to facet the data by age_half
.
To make the graph easier to read, use fct_reorder()
to reorder the treatment
variable by mean_turnout
. Note, you will no longer use the facet_wrap()
, as this will cause issues with the ordering of the treatment
categories.
... %>% ggplot(mapping = aes(x = fct_reorder(..., ...), y = ...)) + geom_col()
Improve the aesthetics of the graph by changing the theme to theme_bw()
.
Great job! This graph allows us to see that the presence of a local administrator at the polling location is by far the most effective strategy for increasing voter turnout. In addition, older voters (category 2) are not only more likely to vote than younger voters, but they are also influenced to a greater extent by the presence of a local administrator.
To finish your plot, use labs()
to change the x-axis label to "Treatment." In addition, give the graph a title, subtitle and y-axis label of your choosing. Your plot should look something like this:
kenya_plot
The sps
data set from the PPBDS.data package contains information about a study done on a popular Mexican health insurance program, Seguro Popular. In the study, some Mexican health clusters were randomly "treated." The treatment consisted of encouragement for people in that health cluster to enroll in the health insurance program, as well as funds to improve health facilities in that cluster. We will create this plot:
sps_plot <- sps %>% filter(education %in% c("preschool", "secondary", "high school", "college")) %>% select(-c(health_exp_1m, t2_health_exp_1m)) %>% mutate(change_in_expenses = t2_health_exp_3m - health_exp_3m) %>% group_by(treatment, education) %>% summarize(mean_change_in_expenses = mean(change_in_expenses)) %>% mutate(treatment = as.factor(treatment)) %>% ggplot(mapping = aes(x = fct_reorder(education, mean_change_in_expenses), y = mean_change_in_expenses, fill = treatment)) + geom_col(position = position_dodge(preserve = "single")) + scale_fill_brewer(palette = "Paired", labels = c("No", "Yes")) + theme_minimal() + labs(x = "Education", y = "Average Change in Expenses", title = "Spending Changes and Seguro Popular", subtitle = "Honestly, not sure what is going on here . . .", caption = "Source: King et al. (2009)") sps_plot
Start by calling glimpse()
on sps
.
glimpse(...)
Using a pipe and the %in%
operator, filter()
sps
to the observations with an education
of "preschool," "secondary," "high school," or "college."
sps %>% filter(education %in% ...)
sps %>% filter(education %in% c(...))
Use select()
to remove the columns health_exp_1m
and t2_health_exp_1m
(these measure health expenses over the past month, whereas health_exp_3m
and t2_health_exp_3m
measure health expenses over the past 3 months).
# Remember that select(-column_name) returns all columns except for that column.
# Consider using c()
Use mutate()
to create a new variable, change_in_expenses
, equal to health_exp_3m
subtracted from t2_health_exp_3m
(this measures the change in expenses after the treatment period).
Group the data by treatment
and education
.
# Use the group_by() function.
Using summarize()
, calculate mean_change_in_expenses
, the average of the change_in_expenses
for each group.
# Use the summarize() function and the helper function mean(). Your pipe should # look like: sps %>% filter(education %in% c("preschool", "secondary", "high school", "college")) %>% select(-c(health_exp_1m, t2_health_exp_1m)) %>% mutate(change_in_expenses = t2_health_exp_3m - health_exp_3m) %>% group_by(treatment, education) %>% summarize(mean_change_in_expenses = mean(change_in_expenses))
use ggplot()
to make a bar graph that maps education
to the x-axis, mean_change_in_expenses
to the y-axis, and treatment
to the fill aesthetic.
# Remember to use geom_col() instead of geom_bar() when mapping a variable to # the y-axis.
Use the position_dodge
function with preserve
equal to "single" to create a dodged barplot. It does not work well! Don't worry, we will fix in in the next exercise.
# You can use the position argument in the geom_col() layer
Because treatment
column is an integer and not a factor, we can't group our data by it. To fix this problem, use mutate()
and as.factor()
to change treatment
to a factor.
Use fct_reorder()
to reorder the four educations by mean_change_in_expenses
.
Change the aesthetics of the plot by using scale_fill_brewer()
with the "Paired" palette. In addition, adjust the labels
argument of scale_fill_brewer()
so that the legend's labels are equal to the vector ("No", "Yes").
... + scale_fill_brewer(palette = ..., labels = ...)
Add theme_minimal()
to change the style of the graph.
Great work! The first thing to notice from our graph is that all of the bars are positive, implying that all medical costs (regardless of treatment) went up over time. However, it is also clear that the treatment caused costs to rise less for all education groups. Finally, it is worth noting that the education category preschool
is the only category that the treatment did not significantly affect costs. Since the treatment was designed to help less educated people the most, this is definitely a fact worth investigating.
To finish your plot, use labs()
to change the x-axis label to "education." Also, give the graph a title and subtitle of your choice. Here, again, is our version:
sps_plot
The shaming
data set chronicles a study that attempted to measure the impact of social pressure on voting. Nearly 350,000 people in Michigan were randomly assigned to 1 of 5 treatment groups before the 2006 Michigan primary. All 5 groups were sent mail before the primary: the "Civic Duty" group had an extra reminder that voting was a civic responsibility, the "Hawthorne" group was told that whether or not they voted would be in the public record, the "Self" group was actually sent the public record of whether or not they voted in 2004, and the "Neighbors" group was sent both their voting record and their neighbors' voting record from 2004. We will reproduce this plot.
shaming_plot <- shaming %>% select(primary_04, primary_06, treatment) %>% mutate(primary_04 = if_else(condition = str_detect(primary_04, "Yes"), true = 1L, false = 0L)) %>% pivot_longer(cols = c(primary_04, primary_06), names_to = "year", values_to = "voted") %>% group_by(treatment, year, voted) %>% summarize(num_voters = n(), .groups = "drop") %>% mutate(voted = if_else(condition = voted == 1, true = "did_vote", false = "did_not_vote")) %>% pivot_wider(names_from = voted, values_from = num_voters) %>% mutate(pct_voted = did_vote / (did_vote + did_not_vote)) %>% ggplot(mapping = aes(x = fct_reorder(treatment, pct_voted), y = pct_voted, fill = year)) + geom_col(position = position_dodge(preserve = "single")) + coord_cartesian(ylim = c(0.2, 0.45)) + labs(x = "Treatment", y = "Voting Percentage", title = "Voting Rates Based on Mailings", subtitle = "Shaming people makes them more likely to vote", caption = "Source: Gerber, Green, and Larimer (2008).") + theme_fivethirtyeight() shaming_plot
Start by running skim()
on shaming
. See if you can find out how many of the participants voted in general_04
. Also, pay close attention to the data types of each of the variables.
Make the data set easier to visualize by limiting the columns to treatment
, primary_04
, and primary_06
.
# Use the select() function.
Use pivot_longer()
to map the names of the primary_04
and primary_06
columns to a new column, "year," and the values of those two columns to a new column, "voted." What goes wrong?
... %>% pivot_longer(cols = ..., names_to = ..., values_to = ...)
pivot_longer()
won't let us combine one column of <chr>
data with one column of <int>
data. Therefore, we're going to have to change the data type of one of the columns.
Before we do that, though, we need to understand the dplyr if_else()
function. if_else()
is used alongside mutate()
to create 2 different values depending on whether a condition
is TRUE
or FALSE
. if_else()
is a more modern, and somewhat safer, version of ifelse()
.
Run the following code to see how we can categorize the birth_year
column into 2 groups based on whether the person was born before 1950.
shaming %>% select(birth_year) %>% mutate(born_before_1950 = if_else(condition = birth_year < 1950, true = "yes", false = "no"))
Now, use if_else()
before the pivot_longer()
statement to mutate()
the primary_04
column. The condition should should be a str_detect()
of the string "Yes" in primary_04
. If the condition is true, return 1 as an integer, and if the condition is false, return 0 as an integer. Don't forget to pivot_longer()
once again.
# Remember to append the letter "L" to make something an integer; e.g. 22L.
# The condition of the if_else() should be str_detect(primary_04, "Yes").
Great job! To continue, group the data by treatment
, year
, and voted
.
# Use the group_by() function.
Using summarize()
, create the variable num_voters
that simply counts the number of rows in each group.
# Remember that n() counts the number of rows in each group.
The voted
column is a bit difficult to understand, as a numerical value (0 or 1) is used to represent an idea (the person did or did not vote). Use another if_else()
statement to mutate()
the voted
column. Condition your if_else()
on whether voted
is equal to 1: true should return "did_vote," and false should return "did_not_vote."
# The condition of the if_else() should be voted == 1.
# Your pipe should look something like this: shaming %>% select(primary_04, primary_06, treatment) %>% mutate(primary_04 = if_else(condition = str_detect(primary_04, "Yes"), true = 1L, false = 0L)) %>% pivot_longer(cols = c(primary_04, primary_06), names_to = "year", values_to = "voted") %>% group_by(treatment, year, voted) %>% summarize(num_voters = n(), .groups = "drop") %>% mutate(voted = if_else(condition = voted == 1, true = "did_vote", false = "did_not_vote"))
Now, we want did_vote
and did_not_vote
to be their own columns. Call a pivot_wider()
function that gets its column names from voted
and its values from num_voters
.
... %>% pivot_wider(names_from = ..., values_from = ...)
mutate()
a new variable, pct_voted
, equal to the number of people who did_vote
divided by the total number of people.
# Note that the total number of people is equal to (did_vote + did_not_vote).
Call ggplot()
to make a bar chart that maps treatment
to the x-axis, pct_voted
to the y-axis, and year
to the fill aesthetic.
# Remember to use geom_col() instead of geom_bar() when you map something to the # y-axis.
A stacked barplot is very hard to analyze here. Use the position_dodge()
function with preserve
set to "single" to make the graph a dodged barplot.
# Set the position argument of geom_col()
Use fct_reorder()
to reorder the five different treatment
s by pct_voted
. In other words, you need to change the value of x
, probably using fct_reorder()
, within the call to aes()
.
Finally, use the coord_cartesian
function to adjust the graph's zoom by setting ylim
to the vector (0.2, 0.45)
.
From this graph, the first thing to notice is that all five of the treatment categories had approximately the same average voter turnout in 2004. We can also see that turnout decreased across the board in 2006. However, each additional level of social shaming led to a substantial increase in voter turnout, with the "Neighbors" treatment particularly effective.
Clean up your graph with some sensible labels. Again, here is our version.
shaming_plot
nes
, short for "National Election Studies," contains the personal and political
information of almost 40,000 American voters, as well as whether or not they voted in that year's presidential election. The goal of this exercise is to replicate this plot:
nes_plot <- nes %>% select(year, education, pres_appr, ideology, voted) %>% filter(education %in% c("Some Highschool", "Highschool", "Some College", "Adv. Degree")) %>% droplevels() %>% mutate(opinion = as.factor(if_else( condition = str_detect(pres_appr, "prove"), true = "has_opinion", false = "no_opinion"))) %>% drop_na() %>% group_by(year, voted, opinion, education) %>% summarize(count = n()) %>% pivot_wider(names_from = voted, values_from = count) %>% mutate(turnout = Yes / (Yes + No)) %>% drop_na() %>% ggplot(aes(x = year, y = turnout, color = fct_reorder2(education, year, turnout, .fun = first2), linetype = fct_reorder2(education, year, turnout, .fun = first2))) + geom_line() + facet_wrap(~ opinion) + scale_color_brewer(palette = "Spectral", name = "education") + scale_linetype(name = "education") + labs(title = "Turnout and Opinions Over Time", subtitle = "People with opinions are more likely to vote", x = "Turnout Percentage", y = "Year") nes_plot
Good luck!
Start by running skim()
on nes
. See if you can figure out the first and last
years in the data set. You may need to load the skimr package first.
Use levels()
and the $
operator to explore the levels of the education
variable.
# The factor that we'd like to investigate is nes$education.
Select the year
, education
, pres_appr
, ideology
, and voted
columns from nes
.
Continue by filtering the tibble such that education
is either "Some Highschool," "Highschool," "Some College," or "Adv. Degree."
# Remember to use the %in% operator along with the c() function.
After filtering a factor, always remember that the levels of the factor have not been removed. To fix this, add a call in the pipe that drops (i.e. permanently removes) all deleted levels.
# Use the droplevels() function.
The dplyr if_else()
function can be used alongside mutate()
and as.factor()
to create a factor with two levels depending on whether a certain condition
is met. Run the code below to see how we can use if_else()
to separate the nes
data set into 2 groups based on whether the value of ideology
is positive.
nes %>% select(year, education, pres_appr, ideology, voted) %>% filter(education %in% c("Some Highschool", "Highschool", "Some College", "Adv. Degree")) %>% droplevels() %>% mutate(pos_ideology = as.factor(if_else(condition = ideology > 0, true = "ideology_is_positive", false = "ideology_not_positive")))
Now, use if_else()
and mutate()
tto create a new variable, opinion
. opinion
should be a factor that equals "has_opinion" if that person either approves or disapproves of the president, but "no_opinion" otherwise. To do this, use the if_else()
function with the condition being detection of the pattern "prove" in the string pres_appr
. You can remove from your pipe the creation of pos_ideology
since we won't be using that variable.
# Remember to use as.factor() before if_else().
# The function we want to use for the condition is str_detect().
nes %>% select(year, education, pres_appr, ideology, voted) %>% filter(education %in% c("Some Highschool", "Highschool", "Some College", "Adv. Degree")) %>% droplevels() %>% mutate(opinion = as.factor(if_else(condition = str_detect(...), true = ..., false = ...)))
Our goal for the next few exercises will be to calculate voter turnout based on education and whether or not the voter has an opinion of the president at that time.
To start, group the tibble by year
, voted
, opinion
, and education
. Then, use summarize()
to make a new variable, count
, that simply counts the number of people (i.e. rows) in each group. What goes wrong?
# Remember that the n() function counts the number of rows in each group.
If we look closely, many of our rows have a value of NA
for either voted
or opinion
. Furthermore, n()
doesn't have an na.rm
argument. To remove the NA
s, use tidyr's drop_na()
to drop any rows with missing values for any variable before the call to group_by()
and summarize()
.
nes %>% select(year, education, pres_appr, ideology, voted) %>% filter(education %in% c("Some Highschool", "Highschool", "Some College", "Adv. Degree")) %>% droplevels() %>% mutate(opinion = as.factor(if_else(condition = str_detect(pres_appr, "prove"), true = "has_opinion", false = "no_opinion"))) %>% drop_na() %>% group_by(year, voted, opinion, education) %>% summarize(count = ...)
Our goal is to calculate voter turnout within each category, the number of people who voted divided by the total number of people. Therefore, we want Yes
and No
(currently in the voted
column) to be their own columns with the values that are currently in the count
column.
Use one of the pivot_
functions to tidy our data in this way.
# Because we're making new columns from values, we want the pivot_wider() # function.
... %>% pivot_wider(names_from = ..., values_from = ...)
Great job! Now, to calculate turnout, use mutate()
to create another variable, turnout
, equal to the number of people who voted (the number of Yes
) divided by the total number of people in each category. To make sure you are caught up, we provide the pipe as it should appear by this stage. You need to add a line to create turnout
. But look at the data when you do! See the rows with are NA for turnout? That will mess up our graphics. So, use drop_na()
again to remove them.
nes %>% select(year, education, pres_appr, ideology, voted) %>% filter(education %in% c("Some Highschool", "Highschool", "Some College", "Adv. Degree")) %>% droplevels() %>% mutate(opinion = as.factor(if_else(condition = str_detect(pres_appr, "prove"), true = "has_opinion", false = "no_opinion"))) %>% drop_na() %>% group_by(year, voted, opinion, education) %>% summarize(count = n()) %>% pivot_wider(names_from = voted, values_from = count)
# Remember that the total number of people in each group is (Yes + No).
Next, use ggplot()
to make a line graph that maps year
to the x-axis, turnout
to the y-axis, and education
to both the color and linetype aesthetics.
Use facet_wrap
to facet the graph by opinion.
This graph is starting to look pretty good, but the ordering of the legend is a bit confusing. Try using an fct_
function to reorder the legend (both the color and linetype aesthetics) so that Adv. Degree is first, Some College is second, and so on. Something goes wrong with this first attempt, but we will fix it in the next exercise.
# Because this is a line graph, we want to use fct_reorder2().
Although 3 of the 4 education
statuses are reordered successfully, Adv. Degree is left on the bottom because it doesn't appear in the second graph. To fix this problem, change the .fun
argument of both fct_reorder2()
calls to first2
(the default is last2
).
... %>% ggplot(aes(x = year, y = turnout, color = fct_reorder2(education, year, turnout, .fun = ...), linetype = fct_reorder2(education, year, turnout, .fun = ...))) + geom_line() + facet_wrap(~ opinion)
Our graph is looking good, but the very long legend title takes up a lot of space. To fix this problem (and to improve the colors of the graph), add a scale_color_brewer()
function with the "Spectral" palette as well as a scale_linetype
function. Use both scale functions to set the name of the legend to "education."
# The scale_color_brewer() function should take 2 arguments, but the # scale_linetype() function should only take 1.
Amazing work! From this graph, we see that Americans with a higher level of education tend to vote more often. Furthermore, Americans with an opinion of the president generally have higher voter turnout than Americans without an opinion of the president. Finally, pretty much 100% of Americans with an advanced degree are able to form an opinion of the president (this is why there is no Advanced Degree line in the no_opinion graph: there are too few data points in the table).
To finish your plot, use labs()
to give the graph a title and subtitle of your choosing, along with some axis labels. Here is our answer. It is OK if your answer looks somewhat different from ours. Something sure seems fishy about that perfectly flat red line on the right . . . Indeed, there are a couple of things about our plot which could be improved . . . Why not improve them?
nes_plot
We will be focusing on the data set airline_saftey
from the fivethirtyeight package. We will recreate this plot step-by-step.
as_plot <- airline_safety %>% pivot_longer(c(incidents_85_99, incidents_00_14, fatal_accidents_00_14, fatal_accidents_85_99, fatalities_00_14, fatalities_85_99), names_to = "type_date", values_to = "count") %>% separate(type_date, c("type", "date"), sep = "s_") %>% filter(date == "00_14", count > 0, type == "incident") %>% mutate(safety_value = avail_seat_km_per_week / count) %>% arrange(safety_value) %>% ggplot(aes(x = airline, y = safety_value)) + geom_col() + theme(axis.text = element_text(size = 5, angle = 90)) + labs(title = "Airline Safety", subtitle = "Airlines with more long-haul routes are safer", x = "Airline", y = "Distance per Incident") as_plot
Start by taking a glimpse
of airline_safety
:
Do you notice anything off about the airline_saftey
data? It's untidy! Over the next few questions, we are going to fix this. Let's start by making the table "longer." Use pivot_longer()
to pivot the columns incidents_85_99
, incidents_00_14
, fatal_accidents_00_14
,fatal_accidents_85_99
, fatalities_00_14
and fatalities_85_99
into the two new columns type_date
and count
.
airline_safety %>% pivot_longer(c(..., ..., ...,..., ..., ...), names_to = "...", values_to = "...")
Nice! Do you see what our next step has to be? In our type_date
column, we have data on both the type of incident and the date range in which it occurred. This should really be two columns. Let's use the separate()
to separate type_date
into the two columns type
and date
. Separate on the s_ characters by setting the sep
argument inside of separate()
to "s_".
... %>% separate(type_date, c("...", "..."), sep = "...")
Fantastic! Now that our data is tidy, we can get to work. Add a filter()
to the pipe so that we narrow down our data set to data where date is "00_14", type is "incident", and count is greater than 0.
Nice! Now use mutate()
to create the new variable safety_value
and set it equal to avail_seat_km_per_week
divided by count
.
Nice! Now arrange the data by safety_value
.
Great! Let's now use our wrangled data to create a bar plot using geom_col()
with airline
on the x axis and safety_value
on the y axis.
Awesome! As you can see, the airline names on the y axis are crowded and basically impossible to read. Let's change this by using theme()
and setting the x axis text size
to 5 and angle
to 90.
... + theme(... = element_text(size = ..., angle = ...))
Great! Now add some labels to finish our plot off! Here is our version. As always, it is OK if yours looks different. Try to make it better!
as_plot
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.