knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(tibble.print_max = 10)
This vignette describes the use of the new pivot_longer()
and pivot_wider()
functions. Their goal is to improve the usability of gather()
and spread()
, and incorporate state-of-the-art features found in other packages.
For some time, it's been obvious that there is something fundamentally wrong with the design of spread()
and gather()
. Many people don't find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing reshaping in R:
pivot_longer()
can work with multiple value variables that may have
different types, inspired by the enhanced melt()
and dcast()
functions provided by the data.table package by Matt Dowle
and Arun Srinivasan.
pivot_longer()
and pivot_wider()
can take a data frame that specifies
precisely how metadata stored in column names becomes data variables (and
vice versa), inspired by the cdata package by John Mount and
Nina Zumel.
In this vignette, you'll learn the key ideas behind pivot_longer()
and pivot_wider()
as you see them used to solve a variety of data reshaping challenges ranging from simple to complex.
To begin we'll load some needed packages. In real analysis code, I'd imagine you'd do with the library(tidyverse)
, but I can't do that here since this vignette is embedded in a package.
library(tidyr) library(dplyr) library(readr)
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns. I don't believe it makes sense to describe a dataset as being in "long form". Length is a relative term, and you can only say (e.g.) that dataset A is longer than dataset B.
pivot_longer()
is commonly needed to tidy wild-caught datasets as they often optimise for ease of data entry or ease of comparison rather than ease of analysis. The following sections show how to use pivot_longer()
for a wide range of realistic datasets.
The relig_income
dataset stores counts based on a survey which (among other things) asked people about their religion and annual income:
relig_income
This dataset contains three variables:
religion
, stored in the rows,income
spread across the column names, and count
stored in the cell values. To tidy it we use pivot_longer()
:
relig_income %>% pivot_longer(!religion, names_to = "income", values_to = "count")
The first argument is the dataset to reshape, relig_income
.
The second argument describes which columns need to be reshaped. In this
case, it's every column apart from religion
.
The names_to
gives the name of the variable that will be created from
the data stored in the column names, i.e. income
.
The values_to
gives the name of the variable that will be created from
the data stored in the cell value, i.e. count
.
Neither the names_to
nor the values_to
column exists in relig_income
, so we provide them as character strings surrounded in quotes.
The billboard
dataset records the billboard rank of songs in the year 2000. It has a form similar to the relig_income
data, but the data encoded in the column names is really a number, not a string.
billboard
We can start with the same basic specification as for the relig_income
dataset. Here we want the names to become a variable called week
, and the values to become a variable called rank
. I also use values_drop_na
to drop rows that correspond to missing values. Not every song stays in the charts for all 76 weeks, so the structure of the input data force the creation of unnessary explicit NA
s.
billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank", values_drop_na = TRUE )
It would be nice to easily determine how long each song stayed in the charts, but to do that, we'll need to convert the week
variable to an integer. We can do that by using two additional arguments: names_prefix
strips off the wk
prefix, and names_transform
converts week
into an integer:
billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", names_transform = list(week = as.integer), values_to = "rank", values_drop_na = TRUE, )
Alternatively, you could do this with a single argument by using readr::parse_number()
which automatically strips non-numeric components:
billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", names_transform = list(week = readr::parse_number), values_to = "rank", values_drop_na = TRUE, )
A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who
dataset:
who
country
, iso2
, iso3
, and year
are already variables, so they can be left as is. But the columns from new_sp_m014
to newrel_f65
encode four variables in their names:
The new_
/new
prefix indicates these are counts of new cases. This
dataset only contains new cases, so we'll ignore it here because it's
constant.
sp
/rel
/ep
describe how the case was diagnosed.
m
/f
gives the gender.
014
/1524
/2535
/3544
/4554
/65
supplies the age range.
We can break these variables up by specifying multiple column names in names_to
, and then either providing names_sep
or names_pattern
. Here names_pattern
is the most natural fit. It has a similar interface to extract
: you give it a regular expression containing groups (defined by ()
) and it puts each group in a column.
who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count" )
We could go one step further use readr functions to convert the gender and age to factors. I think this is good practice when you have categorical variables with a known set of values.
who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", names_transform = list( gender = ~ readr::parse_factor(.x, levels = c("f", "m")), age = ~ readr::parse_factor( .x, levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"), ordered = TRUE ) ), values_to = "count", )
So far, we have been working with data frames that have one observation per row, but many important pivotting problems involve multiple observations per row. You can usually recognise this case because name of the column that you want to appear in the output is part of the column name in the input. In this section, you'll learn how to pivot this sort of data.
The following example is adapted from the data.table vignette, as inspiration for tidyr's solution to this problem.
family <- tribble( ~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2, 1L, "1998-11-26", "2000-01-29", 1L, 2L, 2L, "1996-06-22", NA, 2L, NA, 3L, "2002-07-11", "2004-04-05", 2L, 2L, 4L, "2004-10-10", "2009-08-27", 1L, 1L, 5L, "2000-12-05", "2005-02-28", 2L, 1L, ) family <- family %>% mutate_at(vars(starts_with("dob")), parse_date) family
Note that we have two pieces of information (or values) for each child: their gender
and their dob
(date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to
, using names_sep
to split up each variable name. Note the special name .value
: this tells pivot_longer()
that that part of the column name specifies the "value" being measured (which will become a variable in the output).
family %>% pivot_longer( !family, names_to = c(".value", "child"), names_sep = "_", values_drop_na = TRUE )
Note the use of values_drop_na = TRUE
: the input shape forces the creation of explicit missing variables for observations that don't exist.
This problem also exists in the anscombe
dataset built in to base R:
anscombe
This dataset contains four pairs of variables (x1
and y1
, x2
and y2
, etc) that underlie Anscombe's quartet, a collection of four datasets that have the same summary statistics (mean, sd, correlation etc), but have quite different data. We want to produce a dataset with columns set
, x
and y
.
anscombe %>% pivot_longer(everything(), names_to = c(".value", "set"), names_pattern = "(.)(.)" ) %>% arrange(set)
A similar situation can arise with panel data. For example, take this example dataset provided by Thomas Leeper. We can tidy it using the same approach as for anscombe
:
pnl <- tibble( x = 1:4, a = c(1, 1,0, 0), b = c(0, 1, 1, 1), y1 = rnorm(4), y2 = rnorm(4), z1 = rep(3, 4), z2 = rep(-2, 4), ) pnl %>% pivot_longer( !c(x, a, b), names_to = c(".value", "time"), names_pattern = "(.)(.)" )
Occassionally you will come across datasets that have duplicated column names. Generally, such datasets are hard to work with in R, because when you refer to a column by name it only finds the first match. To create a tibble with duplicated names, you have to explicitly opt out of the name repair that usually prevents you from creating such a dataset:
df <- tibble(id = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal") df
When pivot_longer()
encounters such data, it automatically adds another column to the output:
df %>% pivot_longer(!id, names_to = "name", values_to = "value")
A similar process is applied when multiple input columns are mapped to the same output column, as in the following example where we ignore the numeric suffix on each column name:
df <- tibble(id = 1:3, x1 = 4:6, x2 = 5:7, y1 = 7:9, y2 = 10:12) df %>% pivot_longer(!id, names_to = ".value", names_pattern = "(.).")
pivot_wider()
is the opposite of pivot_longer()
: it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It's relatively rare to need pivot_wider()
to make tidy data, but it's often useful for creating summary tables for presentation, or data in a format needed by other tools.
The fish_encounters
dataset, contributed by Myfanwy Johnston, describes when fish swimming down a river are detected by automatic monitoring stations:
fish_encounters
Many tools used to analyse this data need it in a form where each station is a column:
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
This dataset only records when a fish was detected by the station - it doesn't record when it wasn't detected (this is common with this type of data). That means the output data is filled with NA
s. However, in this case we know that the absence of a record means that the fish was not seen
, so we can ask pivot_wider()
to fill these missing values in with zeros:
fish_encounters %>% pivot_wider( names_from = station, values_from = seen, values_fill = 0 )
You can also use pivot_wider()
to perform simple aggregation. For example, take the warpbreaks
dataset built in to base R (converted to a tibble for the better print method):
warpbreaks <- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks) warpbreaks
This is a designed experiment with nine replicates for every combination of wool
(A
and B
) and tension
(L
, M
, H
):
warpbreaks %>% count(wool, tension)
What happens if we attempt to pivot the levels of wool
into the columns?
warpbreaks %>% pivot_wider(names_from = wool, values_from = breaks)
We get a warning that each cell in the output corresponds to multiple cells in the input. The default behaviour produces list-columns, which contain all the individual values. A more useful output would be summary statistics, e.g. mean
breaks for each combination of wool and tension:
warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = list(breaks = mean) )
For more complex summary operations, I recommend summarising before reshaping, but for simple cases it's often convenient to summarise within pivot_wider()
.
Imagine, as in https://stackoverflow.com/questions/24929954, that we have information containing the combination of product, country, and year. In tidy form it might look like this:
production <- expand_grid( product = c("A", "B"), country = c("AI", "EI"), year = 2000:2014 ) %>% filter((product == "A" & country == "AI") | product == "B") %>% mutate(production = rnorm(nrow(.))) production
We want to widen the data so we have one column for each combination of product
and country
. The key is to specify multiple variables for names_from
:
production %>% pivot_wider( names_from = c(product, country), values_from = production )
When either names_from
or values_from
select multiple variables, you can control how the column names in the output constructed with names_sep
and names_prefix
, or the workhorse names_glue
:
production %>% pivot_wider( names_from = c(product, country), values_from = production, names_sep = ".", names_prefix = "prod." ) production %>% pivot_wider( names_from = c(product, country), values_from = production, names_glue = "prod_{product}_{country}" )
The us_rent_income
dataset contains information about median income and rent for each state in the US for 2017 (from the American Community Survey, retrieved with the tidycensus package).
us_rent_income
Here both estimate
and moe
are values columns, so we can supply them to values_from
:
us_rent_income %>% pivot_wider(names_from = variable, values_from = c(estimate, moe))
Note that the name of the variable is automatically appended to the output columns.
A final challenge is inspired by Jiena Gu. Imagine you have a contact list that you've copied and pasted from a website:
contacts <- tribble( ~field, ~value, "name", "Jiena McLellan", "company", "Toyota", "name", "John Smith", "company", "google", "email", "john@google.com", "name", "Huxley Ratcliffe" )
This is challenging because there's no variable that identifies which observations belong together. We can fix this by noting that every contact starts with a name, so we can create a unique id by counting every time we see "name" as the field
:
contacts <- contacts %>% mutate( person_id = cumsum(field == "name") ) contacts
Now that we have a unique identifier for each person, we can pivot field
and value
into the columns:
contacts %>% pivot_wider(names_from = field, values_from = value)
Some problems can't be solved by pivotting in a single direction. The examples in this section show how you might combine pivot_longer()
and pivot_wider()
to solve more complex problems.
world_bank_pop
contains data from the World Bank about population per country from 2000 to 2018.
world_bank_pop
My goal is to produce a tidy dataset where each variable is in a column. It's not obvious exactly what steps are needed yet, but I'll start with the most obvious problem: year is spread across multiple columns.
pop2 <- world_bank_pop %>% pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value") pop2
Next we need to consider the indicator
variable:
pop2 %>% count(indicator)
Here SP.POP.GROW
is population growth, SP.POP.TOTL
is total population, and SP.URB.*
are the same but only for urban areas. Let's split this up into two variables: area
(total or urban) and the actual variable (population or growth):
pop3 <- pop2 %>% separate(indicator, c(NA, "area", "variable")) pop3
Now we can complete the tidying by pivoting variable
and value
to make TOTL
and GROW
columns:
pop3 %>% pivot_wider(names_from = variable, values_from = value)
Based on a suggestion by Maxime Wack, https://github.com/tidyverse/tidyr/issues/384), the final example shows how to deal with a common way of recording multiple choice data. Often you will get such data as follows:
multi <- tribble( ~id, ~choice1, ~choice2, ~choice3, 1, "A", "B", "C", 2, "C", "B", NA, 3, "D", NA, NA, 4, "B", "D", NA )
But the actual order isn't important, and you'd prefer to have the individual questions in the columns. You can achieve the desired transformation in two steps. First, you make the data longer, eliminating the explcit NA
s, and adding a column to indicate that this choice was chosen:
multi2 <- multi %>% pivot_longer(!id, values_drop_na = TRUE) %>% mutate(checked = TRUE) multi2
Then you make the data wider, filling in the missing observations with FALSE
:
multi2 %>% pivot_wider( id_cols = id, names_from = value, values_from = checked, values_fill = FALSE )
The arguments to pivot_longer()
and pivot_wider()
allow you to pivot a wide range of datasets. But the creativity that people apply to their data structures is seemingly endless, so it's quite possible that you will encounter a dataset that you can't immediately see how to reshape with pivot_longer()
and pivot_wider()
. To gain more control over pivotting, you can instead create a "spec" data frame that describes exactly how data stored in the column names becomes variables (and vice versa). This section introduces you to the spec data structure, and show you how to use it when pivot_longer()
and pivot_wider()
are insufficient.
To see how this works, lets return to the simplest case of pivotting applied to the relig_income
dataset. Now pivotting happens in two steps: we first create a spec object (using build_longer_spec()
) then use that to describe the pivotting operation:
spec <- relig_income %>% build_longer_spec( cols = !religion, names_to = "income", values_to = "count" ) pivot_longer_spec(relig_income, spec)
(This gives the same result as before, just with more code. There's no need to use it here, it is presented as a simple example for using spec
.)
What does spec
look like? It's a data frame with one row for each column in the wide format version of the data that is not present in the long format, and two special columns that start with .
:
.name
gives the name of the column..value
gives the name of the column that the values in the cells will
go into.There is also one column in spec
for each column present in the long format of the data that is not present in the wide format of the data. This corresponds to the names_to
argument in pivot_longer()
and build_longer_spec()
and the names_from
argument in pivot_wider()
and build_wider_spec()
. In this example, the income column is a character vector of the names of columns being pivoted.
spec
Below we widen us_rent_income
with pivot_wider()
. The result is ok, but I think it could be improved:
us_rent_income %>% pivot_wider(names_from = variable, values_from = c(estimate, moe))
I think it would be better to have columns income
, rent
, income_moe
, and rent_moe
, which we can achieve with a manual spec. The current spec looks like this:
spec1 <- us_rent_income %>% build_wider_spec(names_from = variable, values_from = c(estimate, moe)) spec1
For this case, we mutate spec
to carefully construct the column names:
spec2 <- spec1 %>% mutate(.name = paste0(variable, ifelse(.value == "moe", "_moe", ""))) spec2
Supplying this spec to pivot_wider()
gives us the result we're looking for:
pivot_wider_spec(us_rent_income, spec2)
Sometimes it's not possible (or not convenient) to compute the spec, and instead it's more convenient to construct the spec "by hand". For example, take this construction
data, which is lightly modified from Table 5 "completions" found at https://www.census.gov/construction/nrc/index.html:
construction
This sort of data is not uncommon from government agencies: the column names actually belong to different variables, and here we have summaries for number of units (1, 2-4, 5+) and regions of the country (NE, NW, midwest, S, W). We can most easily describe that with a tibble:
spec <- tribble( ~.name, ~.value, ~units, ~region, "1 unit", "n", "1", NA, "2 to 4 units", "n", "2-4", NA, "5 units or more", "n", "5+", NA, "Northeast", "n", NA, "Northeast", "Midwest", "n", NA, "Midwest", "South", "n", NA, "South", "West", "n", NA, "West", )
Which yields the following longer form:
pivot_longer_spec(construction, spec)
Note that there is no overlap between the units
and region
variables; here the data would really be most naturally described in two independent tables.
One neat property of the spec
is that you need the same spec for pivot_longer()
and pivot_wider()
. This makes it very clear that the two operations are symmetric:
construction %>% pivot_longer_spec(spec) %>% pivot_wider_spec(spec)
The pivotting spec allows us to be more precise about exactly how pivot_longer(df, spec = spec)
changes the shape of df
: it will have nrow(df) * nrow(spec)
rows, and ncol(df) - nrow(spec) + ncol(spec) - 2
columns.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.