Data tidying
In r4ds.tutorials: Tutorials for "R for Data Science"

library(learnr)
library(tidyverse)
library(tutorial.helpers)
library(babynames)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

# Needed for lengthening data plot

billboard_longer <- billboard |>
  pivot_longer(cols = starts_with("wk"),
               names_to = "week",
               values_to = "rank",
               values_drop_na = TRUE) |>
  mutate(week = parse_number(week))

# Needed for widening data plot

cms_pivoted <- cms_patient_experience |>
  pivot_wider(id_cols = starts_with("org"), 
            names_from = measure_cd,
            values_from = prf_rate)

# For plots

car_names <- c(`f` = "Front Wheel Drive",
               `4` = "4 Wheel Drive",
               `r` = "Rear Wheel Drive")

Introduction

This tutorial covers Chapter 5: Data tidying from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We use the tidyr package to create "tidy" data, defined as:

Every column is a variable.

Every row is an observation.

Every cell is a single value.

Key functions include pivot_longer() and pivot_wider().

Tidy Data

You can represent the same underlying data in multiple ways by organizing values in a given dataset in different orders. But not all data is equally easy to use.

Exercise 1

In the console, load in the tidyverse package. Copy and paste the loading message (the thing with the conflicts and check marks) into the box below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

dplyr has all your basic data transformation functions like mutate(), filter(), and summarise(). readr allows you to read data that is in the form of a csv, tsv, or spreadsheet.

Exercise 2

Copy and paste the third line of the loading message into the box below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

It should have ggplot2 and tibble.

ggplot2 is how we make plots and maps in R. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. The tibble package allows you to create tibbles, a kind of data frame that is lazy and surly: it does less (i.e. it doesn't change variable names or types, and doesn't do partial matching) and complains more (e.g. when a variable does not exist).

Exercise 3

Copy and paste the 4th line of the loading message into the box below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

It should show lubridate and tidyr

lubridate is a more intuitive way to change and make dates in R, and allows for all of the quirks that our time has (such as leap days and daylight savings) that computers and programs don't understand. The tidyr package is the most important package for this tutorial. It makes sure that you have Tidy Data. The three main rules for having tidy data are:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Exercise 4

Run the code provided in the box below.

table1
table2
table3

table1
table2
table3

These tables show Tuberculosis cases in three countries: China, Brazil, and Afghanistan.

They all show the same data (country, year, population, cases), but one of them is going to be way easier to work with. Let's see if we can identify the "tidy" table.

Exercise 5

Let's take a look at table2 first. Type and run table2 in the box below to see it.

table2

table2

table2 is a tibble with 4 columns and 12 rows. Let's see if the rules apply.

Exercise 6

These are the rules for tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

In the column type, it switches in between cases and population. In the column count it also switches in between the number of cases and the population.

Using the three rules, answer the questions below.

Exercise 7

question_text("Are all the possible variables columns?",
    message = "No",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

All the possible variables are country, year, cases, and population. The type column has two of the variables that are not observations, breaking the first two rules. The count column has observations that switch between cases and population, because there is no cases or population column.

Exercise 8

Now let's look at table3. Type and run table3 in the box below to see it.

table3

table3

Now, this looks like a very nice dataset with 3 columns and 6 rows, but it breaks one of the rules. These are the rules for tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Exercise 9

Using the three rules, answer the question below.

question_text("Are all the cells a single value?",
    message = "No",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

Each of the cells in the rate column are not a single observation. They are population divided by cases, so in order for the data to be considered tidy, we would need to separate those columns.

Exercise 10

Let's look at table1. Type and run table1 in the box below to see it.

table1

table1

table1 is a tibble with 4 columns and 6 rows. Let's see if the rules apply.

Exercise 11

As a refresher, these are the rules for tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

Using the three rules, answer the questions below.

question_text("Are all the rules true in this data?",
    message = "Yes",
    answer(NULL, correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 3)

This is a very good example of a tidy dataset. All the variables are columns. All the columns are variables. All the observations are rows, all the rows are observations, all the values are cells, and all the cells are single values.

Exercise 12

Using the tidy data from table1, we are going to make this plot:

plot1 <- ggplot(data = table1, mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country)) +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Tuberculosis cases in three countries",
       subtitle = "Between 1999 and 2000",
       x = "Year",
       y = "Number of cases",
       color = "Country",
       shape = "Country")

plot1

Start making a plot with ggplot. Use table1 as your data.

ggplot(data = ...)

ggplot(data = table1)

As usual, a call to ggplot() with an aesthetic or geom results in a blank plotting square.

Exercise 13

Add aes() with the x argument set to year and y set to cases.

ggplot(... = ..., 
       mapping = aes(x = ..., y = ...))

ggplot(data = table1,
       mapping = aes(x = year, y = cases))

Providing the aes() generates axis titles and and axis labels.

Exercise 14

Add geom_line() to your plot too.

ggplot(...(...)) + 
  geom_...()

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line()

We clearly don't want to connect the data points in this way.

“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham

Exercise 15

In geom_line(), we need to mess around with a couple things in order to get three separate lines instead of one big "N" on our plot. We need to add an aes inside of `geom_line(), with group equaling country (inside parenthesis).

... + 
  geom_line(aes(... = country))

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country))

The group aesthetic causes ggplot2 to create separate lines for each value of country, which is what we want.

Exercise 16

Now, we need to add a second geom, geom_point(). Inside geom_point() we need another aesthetic, this time with color and shape both equaling country. This will make three distinctive lines that you can clearly see which one is which.

... +
  geom_point(aes(color = ..., shape = ...))

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country)) +
  geom_point(aes(color = country, shape = country))

Technically you could tell before, but this will make it easier to see which one is which. When a categorical variable is mapped to an aesthetic, ggplot() will automatically assign a unique value of the aesthetic to each unique level of the variable, a process known as scaling.

Exercise 17

Let's fix the x and y axes

For the x axis, we need make just two breaks, 1999 and 2000. To do this, we will need to add a layer (using +). The layer we are going to add is scale_x_continuous(). We need to change what the breaks are, using breaks = c(1999, 2000).

... +
  scale_x_continuous(breaks = c(..., ....))

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country)) +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000))

The family of scale_* functions allows for fine-grained control of our plot axes.

Exercise 18

To add commas to the y axis, we will use scale_y_continuous() on another layer. Inside of the parenthesis, we will use labels = scales::comma.

... +
  scale_y_continuous(... = scales::comma)

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country)) +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) +
  scale_y_continuous(labels = scales::comma)

breaks determines which labels are used. labels specifies how those labels are displayed.

Exercise 19

The last step is to add labels to our plot. Reminder: This is what your plot should look like.

plot1

... +
  labs(title = "...",
       subtitle = "...",
       x = "...",
       y = "..."
       color = "Country",
       shape = "Country")

ggplot(data = table1,
       mapping = aes(x = year, y = cases)) +
  geom_line(aes(group = country)) +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Tuberculosis cases in three countries",
       subtitle = "Between 1999 and 2000",
       x = "Year",
       y = "Number of cases",
       color = "Country",
       shape = "Country")

To add labels to the key on the side, you will need to label both the shape and the color "Country" (as that is what the key is telling us)

Lengthening Data

The rules of tidy data might seem so obvious that you wonder if you’ll ever encounter data that isn’t tidy. Unfortunately, almost all real data is untidy, and you will need to clean it.

Exercise 1

For this section, we will need to use the billboard dataset. Type and run billboard in the box below

billboard

billboard

The billboard dataset consists of song rankings from the Billboard top 100 songs in the year 2000. This data is going to be extremely difficult to work with. It has 79 columns! Start a pipe with billboard.

Exercise 2

Pipe billboard to pivot_longer(). We need to grab all the columns that start with wk, because that is something that all 76 of the "extra" columns have in common. We can do this by using cols = starts_with("wk") within the call to pivot_longer().

billboard |>
  pivot_longer(... = starts_with("..."))

billboard |>
  pivot_longer(cols = starts_with("wk"))

cols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or contains("wk")

Exercise 3

What should we do with the new name and value columns?

We need to turn these 76 columns and all these observations into two columns: rank and week. We can do this by adding names_to = "week", and values_to = "rank to pivot_longer(). (Make sure you have commas separating each argument.)

billboard |>
  pivot_longer(cols = starts_with("wk"),
               names_to = "...",
               values_to = "...")

billboard |>
  pivot_longer(cols = starts_with("wk"),
               names_to = "week",
               values_to = "rank")

names_to names the variable stored in the column names, we named that variable week. values_to names the variable stored in the cell values, we named that variable rank.

Exercise 4

Now we need to get rid of all the NA's in the dataset. We can do this by using values_drop_na = TRUE inside pivot_longer()

billboard |>
  pivot_longer(cols = starts_with("wk"),
             names_to = "week",
             values_to = "rank",
             ... = TRUE)

billboard |>
  pivot_longer(cols = starts_with("wk"),
             names_to = "week",
             values_to = "rank",
             values_drop_na = TRUE)

These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset. So we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE.

Exercise 5

Although this data is considered tidy, we can still clean it up to make it more understandable. We can get rid of all the extra wk's in the week column to facilitate our plots.

We do this by continuing our pipe with mutate(). Inside of mutate(), we can set week equal to parse_number(week). This will get rid of all the wk's, leaving just with numbers.

... |>
  mutate(... = parse_number(week))

billboard |>
  pivot_longer(cols = starts_with("wk"),
             names_to = "week",
             values_to = "rank",
             values_drop_na = TRUE) |> 
  mutate(week = parse_number(week))

With mutate() we can convert character strings to numbers. This way of using mutate() doesn't create a new column, but changes an existing column. parse_number() is a handy function that will extract the first number from a string, ignoring all other text.

Exercise 6

Assign this mutated dataset to billboard_longer, using <-

billboard_longer <- billboard |>
  pivot_longer(...) +
  ...

billboard_longer <- billboard |>
  pivot_longer(cols = starts_with("wk"),
             names_to = "week",
             values_to = "rank",
             values_drop_na = TRUE) |> 
  mutate(week = parse_number(week))

This will make it easier to make this dataset into a plot. With billboard_longer, we don't have to pipe the entire process into ggplot().

We are going to be making this plot:

plot2 <- billboard_longer |> 
  ggplot(aes(x = week, y = rank, group = track)) + 
  geom_line(alpha = 0.25) + 
  scale_y_reverse()+
  labs(title = "Billboard top 100 rankings in the year 2000",
       subtitle = "Over the course of 76 Weeks",
       x = "Week",
       y = "Rank")

plot2

Exercise 7

Pipe bilboard_longer to ggplot(). Add aes(), with x equal week, y equal rank, and group equal track. Add geom_line().

billboard_longer |> 
  ggplot(aes(x = ..., 
             y = ..., 
             group = ...)) +
    ...

billboard_longer |> 
  ggplot(aes(x = week, 
             y = rank, 
             group = track)) +
    geom_line()

Interesting. Spend time looking at your data, at every stage of plot creation.

Exercise 8

Inside geom_line() we need to change the opacity of the lines using alpha. We will set it to 0.25.

... +
  geom_line(alpha = ...)

billboard_longer |> 
  ggplot(aes(x = week, 
             y = rank, 
             group = track)) +
    geom_line(alpha = 0.25)

Using alpha to increase transparency makes seeing patterns easier.

Exercise 9

This plot is deceiving. This plot seems to tell us that so many songs get the top place, because so many lines are concentrated there. But rankings go from 100 (lowest) to 1 (highest), so we should probably flip this plot upside down.

We do this with scale_y_reverse(). That will put 1 at the top and 100 at the bottom.

... +
  scale_y_reverse()

billboard_longer |> 
  ggplot(aes(x = week, 
             y = rank, 
             group = track)) +
    geom_line(alpha = 0.25) +
    scale_y_reverse()

Exercise 10

Now, we need to add labels. Reminder: This is what your plot should look like

plot2

... + 
  labs(title = ...,
       subtitle = ...,
       x = ...,
       y = ...)

billboard_longer |> 
  ggplot(aes(x = week, 
             y = rank, 
             group = track)) +
    geom_line(alpha = 0.25) +
    scale_y_reverse() +
    labs(title = "Billboard top 100 rankings in the year 2000",
         subtitle = "Over the course of 76 Weeks",
         x = "Week",
         y = "Rank")

Now you know how to pivot wide datasets, and when to pivot. Now, we need to learn about what we need to do with the really datasets.

Widening Data

pivot_wider() makes datasets wider by increasing columns and reducing rows. It helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data, such as the Census.

For this section, we'll be using the dataset cms_patient_experience, a patient experiences dataset from the Centers of Medicare and Medicaid Services.

Exercise 1

In the Console, look up the help page for tidyr using ? before the package name. Copy and paste the Description from the help page into the box below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

Putting a question mark before a package name is a shortcut to looking up the help page. You could always go to the help pane in the bottom left corner, and search for "tidyr," but ?tidyr is quicker.

Exercise 2

Look up the help page for cms_patient_experience using a question mark in front of it, just as you did with tidyr. Copy and paste the "Usage" section.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

This gives you all the information for this data. What package it comes from, where the data is sourced from, and how to see and use the data.

This data comes from the package tidyr, so we won't need any other libraries other than the tidyverse.

Exercise 3

Let's take a look at our data.

Type cms_patient_experience and hit "Run Code".

cms_patient_experience

cms_patient_experience

cms_patient_experience is a tibble with 5 columns and 500 observations.

Exercise 4

We will want to see 12 rows so we can see how this data is not tidy. You can do that with the function print(). Inside print() we need to specify x and n. x will be the data, and n will be the number of rows we see. Run print() with x equal to cms_patient_experience and n equal to 12.

print(x = ..., 
      ... = 12)

print(x = cms_patient_experience, 
      n = 12)

This data is not tidy! In the column measure_cd, we have 5 repeating rows that would work much better as separate columns.

Exercise 5

pivot_wider() includes many of the same arguments as pivot_longer(). Remember how we used names_to and values_to to make more columns from all the extra rows? Well, for pivot_wider(), we just have to do the opposite. We will use names_to and values_to to add more columns to this dataset.

Pipe cms_patient_experience to pivot_wider(), setting names_from equal to measure_cd and values_from equal to prf_rate.

cms_patient_experience |> 
  pivot_wider(... = measure_cd,
              values_from = ...)

cms_patient_experience |> 
  pivot_wider(names_from = measure_cd,
              values_from = prf_rate)

Exercise 6

This data looks a little bit tidier, but there are still repeated rows in org_nm. We need to tell pivot_wider to uniquely identify the rows in every column (getting rid of duplicates). The columns we need to target are org_pac_id and org_nm. We can tell pivot_wider() to target these rows by adding id_cols = starts_with("org") to pivot_wider().

cms_patient_experience |> 
  pivot_wider(names_from = measure_cd,
              values_from = prf_rate,
              id_cols = ...)

cms_patient_experience |> 
  pivot_wider(names_from = measure_cd,
              values_from = prf_rate,
              id_cols = starts_with("org"))

This will get rid of any duplicate rows in the columns that start with "org'. And as we can see, there are many duplicates. id_cols cut the row count from 500 to 95.

Babynames

The babynames data set comes from the SSA (Social Security Administration). After 1986, all baby's were required to be given a social security number. This data contains all the names, year born, and sex of the babies. The number of names is counted and is converted into a proportion. We will be using this data to analyze the popularity of top 10 names each year.

Let's create this graph:

baby_p <- babynames |>
  group_by(year, sex) |> 
  top_n(n = 10) |>
  ungroup() |> 
  summarise(total = sum(prop), .by = c(sex, year)) |>
  ggplot(mapping = aes(x = year, y = total, color = sex)) +
    geom_point() +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Total Popularity of Top 10 Names",
         subtitle = "The most popular names are beccoming much less popular",
         x = "Year",
         y = "Top 10 Names as Percentage of All Names") 

baby_p

Exercise 1

Use library() to load the babynames package.

library(...)

library(babynames)

Learn more about the package from its website.

Exercise 2

Recall that your Console and the exercise code chunks for this tutorial use different R sessions. The fact that you loaded babynames in an exercise code chunk does not mean that it is loaded in your Console.

In your Console, after loading the babynames package, look up the help page for the babynames tibble with ?babynames. Copy/paste the "Format" paragraph.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The variable prop is more subtle than it might, at first, appear. It is the proportion, within a given year and sex, of the total number of babies who have that given name.

Exercise 3

Run summary() on the babynames tibble.

summary(...)

Note that we have no missing values.

Exercise 4

Print out babynames

babynames

There are almost 2 million observations.

Exercise 5

Pipe babynames to group_by(year, sex). This will lead future commands in the pipe to do all calculations on a per year/sex basis, at least until we ungroup() the data.

babynames |> 
  group_by(year, sex)

Modern R code rarely uses group_by() anymore, prefering to .by/by arguments to Tidyverse functions like summarise().

Exercise 6

Continue the pipe with top_n(n = 10).

... |> 
  top_n(n = 10)

This pulls out the 10 most popular names, for each year and, within each year, for each sex.

Note that there should be 2,760 rows in the resulting tibble: 138 years times 10 names times 2 sexes. Instead we have 2,761 rows. Can you figure out why? Does it mess up our plot?

Exercise 7

Continue the pipe with ungroup().

... |> 
  ungroup()

One reason that group_by() is being phased out is that, unless you remember to ungroup() in the pipe, you will get weird results and/or errors. ungroup() removes the grouping variables which you had placed on the tibble previously.

Exercise 8

Continue the pipe with summarise(), using the argument total = sum(prop).

... |>
  summarise(total = sum(prop))

This produces a somewhat nonsense result. The sum of all the values of prop for all these years and both sexes does not really mean anything.

Exercise 9

Edit the call to the summarise() by adding .by argument with value c(sex, year).

... |> 
  summarise(total = sum(prop), .by = c(...))

This generates the data which we want to plot.

Exercise 10

Continue your pipe with ggplot(). Within aes(), map year to the x-axis, total to the y-axis, and sex to color. Also add geom_point().

...|>
  ggplot(mapping = aes(x = ..., y = ..., color = ...)) +
    geom_point()

Exercise 11

To finish your plot, use labs() to give the graph a title, subtitle, and axis labels.

... +
  labs(...)

Reminder: This is what your plot should look like.

baby_p

Can you see where that extra row we ignored earlier in the analysis messes things up for one point in the male set in the 1880s?

Summary

This tutorial covered Chapter 5: Data tidying from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We used the tidyr package to create "tidy" data, defined as:

Every column is a variable.

Every row is an observation.

Every cell is a single value.

Key functions included pivot_longer() and pivot_wider().

Read the Pivot vignette for more details.

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r4ds.tutorials Tutorials for "R for Data Science"

Data tidying In r4ds.tutorials: Tutorials for "R for Data Science"

Introduction

Tidy Data

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Exercise 19

Lengthening Data

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Widening Data

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Babynames

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Summary

Try the r4ds.tutorials package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

r4ds.tutorials
Tutorials for "R for Data Science"

Data tidying
In r4ds.tutorials: Tutorials for "R for Data Science"