library(learnr) library(tidyverse) library(tutorial.helpers) library(babynames) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") # Needed for lengthening data plot billboard_longer <- billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE) |> mutate(week = parse_number(week)) # Needed for widening data plot cms_pivoted <- cms_patient_experience |> pivot_wider(id_cols = starts_with("org"), names_from = measure_cd, values_from = prf_rate) # For plots car_names <- c(`f` = "Front Wheel Drive", `4` = "4 Wheel Drive", `r` = "Rear Wheel Drive")
This tutorial covers Chapter 5: Data tidying from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We use the tidyr package to create "tidy" data, defined as:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Key functions include pivot_longer()
and pivot_wider()
.
You can represent the same underlying data in multiple ways by organizing values in a given dataset in different orders. But not all data is equally easy to use.
In the console, load in the tidyverse package. Copy and paste the loading message (the thing with the conflicts and check marks) into the box below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
dplyr has all your basic data transformation functions like mutate()
, filter()
, and summarise()
. readr allows you to read data that is in the form of a csv, tsv, or spreadsheet.
Copy and paste the third line of the loading message into the box below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
It should have ggplot2 and tibble.
ggplot2 is how we make plots and maps in R. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. The tibble package allows you to create tibbles, a kind of data frame that is lazy and surly: it does less (i.e. it doesn't change variable names or types, and doesn't do partial matching) and complains more (e.g. when a variable does not exist).
Copy and paste the 4th line of the loading message into the box below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
It should show lubridate and tidyr
lubridate is a more intuitive way to change and make dates in R, and allows for all of the quirks that our time has (such as leap days and daylight savings) that computers and programs don't understand. The tidyr package is the most important package for this tutorial. It makes sure that you have Tidy Data. The three main rules for having tidy data are:
Run the code provided in the box below.
table1 table2 table3
table1 table2 table3
These tables show Tuberculosis cases in three countries: China, Brazil, and Afghanistan.
They all show the same data (country, year, population, cases), but one of them is going to be way easier to work with. Let's see if we can identify the "tidy" table.
Let's take a look at table2
first. Type and run table2
in the box below to see it.
table2
table2
table2
is a tibble with 4 columns and 12 rows. Let's see if the rules apply.
These are the rules for tidy data:
In the column type
, it switches in between cases
and population
. In the column count
it also switches in between the number of cases and the population.
Using the three rules, answer the questions below.
question_text("Are all the possible variables columns?", message = "No", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 3)
All the possible variables are country
, year
, cases
, and population
. The type
column has two of the variables that are not observations, breaking the first two rules. The count
column has observations that switch between cases and population, because there is no cases
or population
column.
Now let's look at table3
. Type and run table3
in the box below to see it.
table3
table3
Now, this looks like a very nice dataset with 3 columns and 6 rows, but it breaks one of the rules. These are the rules for tidy data:
Using the three rules, answer the question below.
question_text("Are all the cells a single value?", message = "No", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 3)
Each of the cells in the rate
column are not a single observation. They are population
divided by cases
, so in order for the data to be considered tidy, we would need to separate those columns.
Let's look at table1
. Type and run table1
in the box below to see it.
table1
table1
table1
is a tibble with 4 columns and 6 rows. Let's see if the rules apply.
As a refresher, these are the rules for tidy data:
Using the three rules, answer the questions below.
question_text("Are all the rules true in this data?", message = "Yes", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 3)
This is a very good example of a tidy dataset. All the variables are columns. All the columns are variables. All the observations are rows, all the rows are observations, all the values are cells, and all the cells are single values.
Using the tidy data from table1
, we are going to make this plot:
plot1 <- ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country)) + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000)) + scale_y_continuous(labels = scales::comma) + labs(title = "Tuberculosis cases in three countries", subtitle = "Between 1999 and 2000", x = "Year", y = "Number of cases", color = "Country", shape = "Country") plot1
Start making a plot with ggplot
. Use table1
as your data.
ggplot(data = ...)
ggplot(data = table1)
As usual, a call to ggplot()
with an aesthetic or geom results in a blank plotting square.
Add aes()
with the x
argument set to year
and y
set to cases
.
ggplot(... = ..., mapping = aes(x = ..., y = ...))
ggplot(data = table1, mapping = aes(x = year, y = cases))
Providing the aes()
generates axis titles and and axis labels.
Add geom_line()
to your plot too.
ggplot(...(...)) + geom_...()
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line()
We clearly don't want to connect the data points in this way.
“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham
In geom_line()
, we need to mess around with a couple things in order to get three separate lines instead of one big "N" on our plot. We need to add an aes
inside of `geom_line(), with group equaling country (inside parenthesis).
... + geom_line(aes(... = country))
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country))
The group
aesthetic causes ggplot2 to create separate lines for each value of country
, which is what we want.
Now, we need to add a second geom
, geom_point()
. Inside geom_point()
we need another aesthetic, this time with color and shape both equaling country
. This will make three distinctive lines that you can clearly see which one is which.
... + geom_point(aes(color = ..., shape = ...))
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country)) + geom_point(aes(color = country, shape = country))
Technically you could tell before, but this will make it easier to see which one is which. When a categorical variable is mapped to an aesthetic, ggplot()
will automatically assign a unique value of the aesthetic to each unique level of the variable, a process known as scaling.
Let's fix the x and y axes
For the x axis, we need make just two breaks, 1999 and 2000. To do this, we will need to add a layer (using +
). The layer we are going to add is scale_x_continuous()
. We need to change what the breaks are, using breaks = c(1999, 2000)
.
... + scale_x_continuous(breaks = c(..., ....))
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country)) + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000))
The family of scale_*
functions allows for fine-grained control of our plot axes.
To add commas to the y axis, we will use scale_y_continuous()
on another layer. Inside of the parenthesis, we will use labels = scales::comma
.
... + scale_y_continuous(... = scales::comma)
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country)) + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000)) + scale_y_continuous(labels = scales::comma)
breaks
determines which labels are used. labels
specifies how those labels are displayed.
The last step is to add labels to our plot. Reminder: This is what your plot should look like.
plot1
... + labs(title = "...", subtitle = "...", x = "...", y = "..." color = "Country", shape = "Country")
ggplot(data = table1, mapping = aes(x = year, y = cases)) + geom_line(aes(group = country)) + geom_point(aes(color = country, shape = country)) + scale_x_continuous(breaks = c(1999, 2000)) + scale_y_continuous(labels = scales::comma) + labs(title = "Tuberculosis cases in three countries", subtitle = "Between 1999 and 2000", x = "Year", y = "Number of cases", color = "Country", shape = "Country")
To add labels to the key on the side, you will need to label both the shape and the color "Country" (as that is what the key is telling us)
The rules of tidy data might seem so obvious that you wonder if you’ll ever encounter data that isn’t tidy. Unfortunately, almost all real data is untidy, and you will need to clean it.
For this section, we will need to use the billboard
dataset. Type and run billboard
in the box below
billboard
billboard
The billboard
dataset consists of song rankings from the Billboard top 100 songs in the year 2000. This data is going to be extremely difficult to work with. It has 79 columns! Start a pipe with billboard
.
Pipe billboard
to pivot_longer()
. We need to grab all the columns that start with wk
, because that is something that all 76 of the "extra" columns have in common. We can do this by using cols = starts_with("wk")
within the call to pivot_longer()
.
billboard |> pivot_longer(... = starts_with("..."))
billboard |> pivot_longer(cols = starts_with("wk"))
cols
specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select()
so here we could use !c(artist, track, date.entered)
or contains("wk")
What should we do with the new name
and value
columns?
We need to turn these 76 columns and all these observations into two columns: rank
and week
. We can do this by adding names_to = "week"
, and values_to = "rank
to pivot_longer()
. (Make sure you have commas separating each argument.)
billboard |> pivot_longer(cols = starts_with("wk"), names_to = "...", values_to = "...")
billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank")
names_to
names the variable stored in the column names, we named that variable week
. values_to
names the variable stored in the cell values, we named that variable rank
.
Now we need to get rid of all the NA's in the dataset. We can do this by using values_drop_na = TRUE
inside pivot_longer()
billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank", ... = TRUE)
billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE)
These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset. So we can ask pivot_longer()
to get rid of them by setting values_drop_na = TRUE
.
Although this data is considered tidy, we can still clean it up to make it more understandable. We can get rid of all the extra wk
's in the week
column to facilitate our plots.
We do this by continuing our pipe with mutate()
. Inside of mutate()
, we can set week
equal to parse_number(week)
. This will get rid of all the wk
's, leaving just with numbers.
... |> mutate(... = parse_number(week))
billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE) |> mutate(week = parse_number(week))
With mutate()
we can convert character strings to numbers. This way of using mutate()
doesn't create a new column, but changes an existing column. parse_number()
is a handy function that will extract the first number from a string, ignoring all other text.
Assign this mutated dataset to billboard_longer
, using <-
billboard_longer <- billboard |> pivot_longer(...) + ...
billboard_longer <- billboard |> pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE) |> mutate(week = parse_number(week))
This will make it easier to make this dataset into a plot. With billboard_longer
, we don't have to pipe the entire process into ggplot()
.
We are going to be making this plot:
plot2 <- billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25) + scale_y_reverse()+ labs(title = "Billboard top 100 rankings in the year 2000", subtitle = "Over the course of 76 Weeks", x = "Week", y = "Rank") plot2
Pipe bilboard_longer
to ggplot()
. Add aes()
, with x
equal week
, y
equal rank
, and group
equal track
. Add geom_line()
.
billboard_longer |> ggplot(aes(x = ..., y = ..., group = ...)) + ...
billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line()
Interesting. Spend time looking at your data, at every stage of plot creation.
Inside geom_line()
we need to change the opacity of the lines using alpha
. We will set it to 0.25.
... + geom_line(alpha = ...)
billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25)
Using alpha
to increase transparency makes seeing patterns easier.
This plot is deceiving. This plot seems to tell us that so many songs get the top place, because so many lines are concentrated there. But rankings go from 100 (lowest) to 1 (highest), so we should probably flip this plot upside down.
We do this with scale_y_reverse()
. That will put 1 at the top and 100 at the bottom.
... + scale_y_reverse()
billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25) + scale_y_reverse()
Now, we need to add labels. Reminder: This is what your plot should look like
plot2
... + labs(title = ..., subtitle = ..., x = ..., y = ...)
billboard_longer |> ggplot(aes(x = week, y = rank, group = track)) + geom_line(alpha = 0.25) + scale_y_reverse() + labs(title = "Billboard top 100 rankings in the year 2000", subtitle = "Over the course of 76 Weeks", x = "Week", y = "Rank")
Now you know how to pivot wide datasets, and when to pivot. Now, we need to learn about what we need to do with the really datasets.
pivot_wider()
makes datasets wider by increasing columns and reducing rows. It helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data, such as the Census.
For this section, we'll be using the dataset cms_patient_experience
, a patient experiences dataset from the Centers of Medicare and Medicaid Services.
In the Console, look up the help page for tidyr using ?
before the package name. Copy and paste the Description from the help page into the box below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
Putting a question mark before a package name is a shortcut to looking up the help page. You could always go to the help pane in the bottom left corner, and search for "tidyr," but ?tidyr
is quicker.
Look up the help page for cms_patient_experience
using a question mark in front of it, just as you did with tidyr
. Copy and paste the "Usage" section.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
This gives you all the information for this data. What package it comes from, where the data is sourced from, and how to see and use the data.
This data comes from the package tidyr, so we won't need any other libraries other than the tidyverse.
Let's take a look at our data.
Type cms_patient_experience
and hit "Run Code".
cms_patient_experience
cms_patient_experience
cms_patient_experience
is a tibble with 5 columns and 500 observations.
We will want to see 12 rows so we can see how this data is not tidy. You can do that with the function print()
. Inside print()
we need to specify x
and n
. x
will be the data, and n
will be the number of rows we see. Run print()
with x
equal to cms_patient_experience
and n
equal to 12
.
print(x = ..., ... = 12)
print(x = cms_patient_experience, n = 12)
This data is not tidy! In the column measure_cd
, we have 5 repeating rows that would work much better as separate columns.
pivot_wider()
includes many of the same arguments as pivot_longer()
. Remember how we used names_to
and values_to
to make more columns from all the extra rows? Well, for pivot_wider()
, we just have to do the opposite. We will use names_to
and values_to
to add more columns to this dataset.
Pipe cms_patient_experience
to pivot_wider()
, setting names_from
equal to measure_cd
and values_from
equal to prf_rate
.
cms_patient_experience |> pivot_wider(... = measure_cd, values_from = ...)
cms_patient_experience |> pivot_wider(names_from = measure_cd, values_from = prf_rate)
This data looks a little bit tidier, but there are still repeated rows in org_nm
. We need to tell pivot_wider
to uniquely identify the rows in every column (getting rid of duplicates). The columns we need to target are org_pac_id
and org_nm
. We can tell pivot_wider()
to target these rows by adding id_cols = starts_with("org")
to pivot_wider()
.
cms_patient_experience |> pivot_wider(names_from = measure_cd, values_from = prf_rate, id_cols = ...)
cms_patient_experience |> pivot_wider(names_from = measure_cd, values_from = prf_rate, id_cols = starts_with("org"))
This will get rid of any duplicate rows in the columns that start with "org'. And as we can see, there are many duplicates. id_cols
cut the row count from 500 to 95.
The babynames
data set comes from the SSA (Social Security Administration). After 1986, all baby's were required to be given a social security number. This data contains all the names, year born, and sex of the babies. The number of names is counted and is converted into a proportion. We will be using this data to analyze the popularity of top 10 names each year.
Let's create this graph:
baby_p <- babynames |> group_by(year, sex) |> top_n(n = 10) |> ungroup() |> summarise(total = sum(prop), .by = c(sex, year)) |> ggplot(mapping = aes(x = year, y = total, color = sex)) + geom_point() + scale_y_continuous(labels = scales::percent) + labs(title = "Total Popularity of Top 10 Names", subtitle = "The most popular names are beccoming much less popular", x = "Year", y = "Top 10 Names as Percentage of All Names") baby_p
Use library()
to load the babynames package.
library(...)
library(babynames)
Learn more about the package from its website.
Recall that your Console and the exercise code chunks for this tutorial use different R sessions. The fact that you loaded babynames in an exercise code chunk does not mean that it is loaded in your Console.
In your Console, after loading the babynames package, look up the help page for the babynames
tibble with ?babynames
. Copy/paste the "Format" paragraph.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
The variable prop
is more subtle than it might, at first, appear. It is the proportion, within a given year
and sex
, of the total number of babies who have that given name.
Run summary()
on the babynames
tibble.
summary(...)
Note that we have no missing values.
Print out babynames
babynames
There are almost 2 million observations.
Pipe babynames
to group_by(year, sex)
. This will lead future commands in the pipe to do all calculations on a per year/sex basis, at least until we ungroup()
the data.
babynames |> group_by(year, sex)
Modern R code rarely uses group_by()
anymore, prefering to .by
/by
arguments to Tidyverse functions like summarise()
.
Continue the pipe with top_n(n = 10)
.
... |> top_n(n = 10)
This pulls out the 10 most popular names, for each year and, within each year, for each sex.
Note that there should be 2,760 rows in the resulting tibble: 138 years times 10 names times 2 sexes. Instead we have 2,761 rows. Can you figure out why? Does it mess up our plot?
Continue the pipe with ungroup()
.
... |> ungroup()
One reason that group_by()
is being phased out is that, unless you remember to ungroup()
in the pipe, you will get weird results and/or errors. ungroup()
removes the grouping variables which you had placed on the tibble previously.
Continue the pipe with summarise()
, using the argument total = sum(prop)
.
... |> summarise(total = sum(prop))
This produces a somewhat nonsense result. The sum of all the values of prop
for all these years and both sexes does not really mean anything.
Edit the call to the summarise()
by adding .by
argument with value c(sex
, year
).
... |> summarise(total = sum(prop), .by = c(...))
This generates the data which we want to plot.
Continue your pipe with ggplot()
. Within aes()
, map year
to the x-axis, total
to the y-axis, and sex
to color. Also add geom_point()
.
...|> ggplot(mapping = aes(x = ..., y = ..., color = ...)) + geom_point()
To finish your plot, use labs()
to give the graph a title, subtitle, and axis labels.
... + labs(...)
Reminder: This is what your plot should look like.
baby_p
Can you see where that extra row we ignored earlier in the analysis messes things up for one point in the male set in the 1880s?
This tutorial covered Chapter 5: Data tidying from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We used the tidyr package to create "tidy" data, defined as:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Key functions included pivot_longer()
and pivot_wider()
.
Read the Pivot vignette for more details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.