library(learnr) library(tutorial.helpers) library(tidyverse) library(tidyverse) library(repurrrsive) library(jsonlite) knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(out.width = '90%') options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") x1 <- list(1:4, "a", TRUE) x2 <- list(a = 1:2, b = 1:3, c = 1:4) x5 <- list(1, list(2, list(3, list(4, list(5))))) df <- tibble( x = 1:2, y = c("a", "b"), z = list(list(1, 2), list(3, 4, 5)) ) df1 <- tribble( ~x, ~y, 1, list(a = 11, b = 12), 2, list(a = 21, b = 22), 3, list(a = 31, b = 32) ) df2 <- tribble( ~x, ~y, 1, list(11, 12, 13), 2, list(21), 3, list(31, 32) ) df6 <- tribble( ~x, ~y, "a", list(1, 2), "b", list(3), "c", list() ) gh_users2 <- read_json(gh_users_json()) json <- '[ {"name": "John", "age": 34}, {"name": "Susan", "age": 27} ]' json1 <- '{ "status": "OK", "results": [ {"name": "John", "age": 34}, {"name": "Susan", "age": 27} ] } ' df3 <- tibble(json = parse_json(json)) df4 <- tibble(json1 = list(parse_json(json1))) repos <- tibble(json = gh_repos) chars <- tibble(json = got_chars) locations <- gmaps_cities |> unnest_wider(json) |> select(-status) |> unnest_longer(results) |> unnest_wider(results)
This tutorial covers Chapter 23: Hierarchical data from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to work with non-rectanglar data using packages like jsonlite.
So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type.If you want to store elements of different types in the same vector, you’ll need a list, which you create with list()
.
To create our first list, let's create a variable x1
and assign the list()
function to it. We will pass in three different data types within list()
: the range of integers 1:4
to generate four integers, the string "a"
, and the Boolean value TRUE
. On a new line, we'll call the variable x1
.
x1 <- list(...,"a",...) x1
When you run x1
, you will get all the data types that were passed in, along with their corresponding indices. If you want to access a specific data type by its index, such as finding the element in the first index of x1
, you can use the square brackets notation: x1[1]
. This will only return the integers.
It's indeed convenient to name the components, or children, of a list. To accomplish this, let's create a new variable x2
and assign the list()
function to it. Within list()
, we'll define three columns named a
, b
, and c
. We'll assign 1:2
to a
, 1:3
to b
, and 1:4
to c
. On a new line call x2
.
... <- list(a = ..., b = ..., ... = 1:4) x2
When you run x2
, you will see that each column contains different values. To access individual columns of the list, you can use the $
operator since we have named the columns. For example, to access the column a from x2
, you can use x2$a
.
Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is str()
, which generates a compact display of the structure, de-emphasizing the contents.
Run str()
and pass in x1
and on a new line run str()
again and pass in x2
.
str(...) str(...)
As you can see, str()
displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.
Lists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures.
To create a representation of hierarchical structures, let's create a variable x3
and assign list()
to it. Within list()
, we will pass in list(1,2)
and list(3,4)
. This means we will have two lists nested inside a single list. Then on a new line run str()
and pass in x3
.
... <- list(list(...), ...(3, 4)) str(...)
This is notably different to c()
, which generates a flat vector like when we run c(c(1, 2), c(3, 4))
, we get #> [1] 1 2 3 4
which is flat.
Now what if we pass in two list in a vector, will we get a hierarchical vector or a flat vector?
To find out, create a new variable x4
and assign it to vector function c()
. Within c()
pass in list(1,2)
and pass in list(3,4)
. On a new line, run str()
and pass in x4
.
... <- c(list(...), ...(3, ...)) str(...)
Even when we pass in lists, we get no hierarchical data and just get the numbers in a row-like order.
As lists get more complex, str()
gets more useful, as it lets you see the hierarchy at a glance.
Run x5
which is very complex list.
x5
When we ran x5
we get some many duplicate numbers in brackets and it's all confusing and complex. This is where str()
comes to shine.
Now run str(x5)
.
str(x5)
Now this dosen't mean in you should always use str()
to visualize because as lists get even larger and more complex, str()
eventually starts to fail, and you’ll need to switch to View()
.
The image below show the result of calling view(x5)
.
knitr::include_graphics("images/View-1-min.png")
The RStudio view()
lets you interactively explore a complex list. The viewer opens showing only the top level of the list.
Clicking on the rightward facing triangle expands that component of the list so that you can also see its children like the image below.
knitr::include_graphics("images/View-2-min.png")
You can repeat this operation as many times as needed to get to the data you’re interested in. Note the bottom-left corner: if you click an element of the list, RStudio will give you the sub-setting code needed to access it, in this case x5[[2]][[2]][[2]]
Like said previously, the image below is what it looks like if we repeat the operation.
knitr::include_graphics("images/View-3-min.png")
Now that's we have knowledge of hierarchical data, let's move on to list-columns.
Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldn’t usually belong in there.
To demonstrate a simple example of list-column, create a variable df
and assign tibble()
to it. Within df
we will have three columns: x
, y
, z
. Pass in 1:2
for x
, c("a","b")
for y
, and list(list(1,2),list(3,4,5))
for z
. On a new line run df
.
... <- tibble( ... = 1:2, y = c("a", "b"), ... = list(list(1,...), ...(3, ..., 5)) ) df
In particular, list-columns are used a lot in the tidymodels ecosystem, because they allow you to store things like model outputs or resamples in a data frame.
There’s nothing special about lists in a tibble; they behave like any other column. To show this, start a pipe with df
to filter()
and we want all values where x == 1
only.
df |> filter(x == 1)
Computing with list-columns is harder, but that’s because computing with lists is harder in general.
Good work on finishing this section, next up, we'll be focusing on unnesting list-columns into regular variables.
Now that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.
To generate simple sample data for this section, let's create a variable named df1
and assign tribble()
to it. Within tribble()
, we will define the columns ~x
and ~y
. Now, to arrange the data in a specific order, follow these steps:
Set the value as 1
for ~x
and the list as list(a = 11, b = 12)
for ~y
.
Repeat the above steps as needed for the rest.
Set the value as 2
and the list as list(a = 21, b = 22)
.
Set the value as 3
and the list as list(a = 31, b = 32)
.
... <- tribble( ~x, ~y, 1, ..., ..., list(a = 21,...), 3, list(..., b = 32), )
List-columns tend to come in two basic forms: named and unnamed. When the children are named, they tend to have the same names in every row. For example, in df1
, every element of list-column y
has two elements named a
and b
. Named list-columns naturally unnest into columns: each named element becomes a new named column.
Let's now create our second simple dataset. Copy the previous code and change the variable name to df2
and the other thing we will change is ~y
. Remove all the old ones and insert the following values:
list(11, 12, 13)
,
list(21)
, and
list(31, 32)
.
df2 <- tribble( ~x, ..., ..., list(..., 12, 13), 2, list(...), ..., list(31, ...), )
When the children are unnamed, the number of elements tends to vary from row-to-row. For example, in df2
, the elements of list-column y
are unnamed and vary in length from one to three. Now tidyr provides two functions for these two cases: unnest_wider()
and unnest_longer()
, which will help us explore the two sample data sets which we created.
Let's explore what unnest_wider()
does, so start a pipe with df1
to unnest_wider()
and pass in the column y
as the argument.
df1 |> unnest_wider(...)
When each row has the same number of elements with the same names, like df1
, it’s natural to put each component into its own column with unnest_wider()
.
By default, the names of the new columns come exclusively from the names of the list elements, but you can use the names_sep
argument to request that they combine the column name and the element name. Copy the previous code and within unnest_wider()
add names_sep
and set it equal to "_"
... |> unnest_wider(y, ... = "_")
Compared to just passing in y
in unnest_wider()
, we can now see the column name (y
) along with the element name(a
or b
). Now we will move into unnest_longer()
which is more useful for data sample with different amount of data for record.
Start a pipe with df2
to unnest_longer()
and pass in y
as the argument.
... |> unnest_longer(y)
When each row contains an unnamed list, it’s most natural to put each element into its own row with unnest_longer()
which is what happens in the code above. Also note how x
is duplicated for each element inside of y
: we get one row of output for each element inside the list-column.
But what happens if one of the elements is empty? To see what happens, let's first create another sample of data, so create a new variable df6
and assign it to tribble()
. Within tribble()
, we will define the columns ~x
and ~y
. Now, to arrange the data in a specific order, follow these steps:
Set the value as "a"
for ~x
and the list as list(1, 2)
for ~y
.
Repeat the above steps as needed for the rest.
Set the value as "b"
and the list as list(3)
.
Set the value as "c"
and the list as list()
.
... <- tribble( ~x, ~y, "...", list(..., 2), "b", list(3), ..., ... )
As you can see we have a empty list for record "c"
, next up we use unnest_longer()
to test it out.
Start a pipe with df6
to unnest_longer()
and pass in the y
column as the argument for unnest_longer()
.
... |> unnest_longer(y)
We get zero rows in the output for "c"
, so the row effectively disappears. If you want to preserve that row, adding NA in y
, set keep_empty = TRUE
.
Good work finishing unnesting section.
The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to unnest_longer()
and/or unnest_wider()
. To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.
We’ll start with gh_repos
. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; we recommend exploring a little on your own with View(gh_repos)
before we continue.
gh_repos
is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We'll call this column json
for reasons we’ll get to later. To do so, create a variable named repos
and assign it to tibble()
. Set the column name as json
and assign gh_repos
to json
. On a new line call repos
.
repos <- tibble(...= ...) reposS
This tibble contains 6 rows, one row for each child of gh_repos
. Each row contains a unnamed list with either 26 or 30 rows.
Since the lists are unnamed, we’ll start with unnest_longer()
to put each child in its own row: Start a pipe with repos
to unnest_longer()
and pass in json
as the argument.
repos |> ...(json)
At first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of json
is still a list.
There’s an important difference in the result of the code: now each element is a named list so we can use unnest_wider()
to put each element into its own column: Continue the pipe from unnest_longer()
in previous exercise to unnest_wider()
and pass in json
as the arguement.
... |> ...(json)|> unnest_wider(...)
This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them!
To print all the column names, continue the pipe from unnest_wider()
to names()
. To obtain only the first 10 names, further extend the pipe from names()
to head()
and provide the argument 10
.
... |> ...()|> head(...)
We have so many columns but let’s pull out a few that look interesting.
Delete the names()
and head()
functions and instead extend the pipe to select()
where we will pass in : id
, full_name
, owner
, description
which are the columns we will explore.
... |> unnest_longer(...) |> ...(json) |> select(..., full_name, ..., ...)
You can use this to work back to understand how gh_repos
was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
owner
is another list-column, and since it contains a named list, we can use unnest_wider()
to get at the values: Extend pipe from select()
to unnest_wider()
and pass in owner
. Just as a reminder, it is possible to get errors so just move on.
... |> unnest_longer(...) |> ...(json) |> select(..., full_name, ..., ...) |> unnest_wider(...)
Uh oh, this list column also contains an id
column and we can’t have two id
columns in the same data frame.
As suggested by Rstudio, let's use names_sep
to resolve the problem, so within unnest_wider()
add names_sep = "_"
.
... |> unnest_longer(...) |> ...(json) |> select(..., full_name, ..., ...) |> unnest_wider(..., ... = "_")
This gives another wide dataset, but you can get the sense that owner
appears to contain a lot of additional data about the person who “owns” the repository. Next up, we will move onto relational data and unnesting it.
Nested data is sometimes used to represent data that we’d usually spread across multiple data frames. For example, take got_chars
which contains data about characters that appear in the Game of Thrones books and TV series. Like gh_repos
it’s a list, so we start by turning it into a list-column of a tibble:
Create variable chars
and assign it to tibble()
. Within tibble()
, set the column name to json
and set json
equal to got_chars
. On a new line call chars
.
... <- ...(json = ...) chars
The json
column contains named elements, so we’ll start by widening it.
Start a pipe with chars
to unnest_wider()
and pass in json
.
... |> unnest_wider(...)
We have 18 columns and not all of them are releveant so let's only select few coloumns next up.
Extend the pipe from unnest_wider()
to select()
and pass in the following columns to select: id, name, gender, culture, born, died, alive
.
... |> ..._wider(json) |> select(..., name, ..., culture, born, ..., alive)
This dataset also contains many list-columns, so let's now select columns where there are records which include lists.
Modify select()
to select id
as well as any list-columns with where(is.list)
.
... |> ..._wider(json) |> select(id, where(...))
Let’s explore the titles
column. It’s an unnamed list-column, so we’ll unnest it into rows.
Modify select()
to only select id
and titles
. Then extend the pipe to unnest_longer()
and pass in titles
to unnest the titles.
... |> ..._wider(json) |> select(id, ...)|> unnest_...(titles)
You might expect to see this data in its own table because it would be easy to join to the characters data as needed. Let’s do that, which requires little cleaning.
Let's make the data to be in its own table so assign the whole code we wrote to titles
using <-
. In addition, let's clean the data so extend the pipe to filter()
and pass in titles != ""
which filters out all the missing values and then extend the pipe the to rename()
and pass in title = titles
. On a new line call the whole table using titles
.
... <- chars |> unnest_wider(...) |> select(..., titles) |> ...(titles) |> filter(titles != ...) |> ...(title = ...) titles
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
Good work unnesting relational data. Next up we will unnest deeply nested data.
We’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of unnest_wider()
and unnest_longer()
to unravel: gmaps_cities
. This is a two column tibble containing five city names and the results of using Google’s geocoding API to determine their location:
Run gmaps_cities
.
gmaps_cities
json
is a list-column with internal names, so we start exploring json
with unnest_wider()
.
Start a pipe with gmaps_cities
to unnest_wider()
and pass in json
.
gmaps_cities |> unnest_wider(...)
This gives us the status and the results. We’ll drop the status
column since they’re all OK
; in a real analysis, you’d also want to capture all the rows where status != "OK"
and figure out what went wrong. results
is an unnamed list, with either one or two elements (we’ll see why shortly).
Extend the pipe to select()
and pass in -status
to remove the status
column and then extend the pipe from select()
to unnest_longer()
. Within unnest_longer()
pass in results
.
... |> ...(json) |> select(-...) |> unnest_longer(...)
Now results is a named list, so we’ll use unnest_wider()
.
Extend the pipe to unnest_wider()
and pass in results
, assign the whole code to locations
using <-
and call locations
on a new line.
locations <- ... |> ...(json) |> select(-...) |> unnest_longer(...) |> unnest_wider(json) locations
Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
There are a few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column. Start a new pipe with locations
to select()
and we will select the following: city
, formatted_address
, and geometry
. Then we will extend the pipe to unnest_wider()
and pass in geometry
to extend the list.
... |> ...(city, formatted_address, geometry) |> unnest_wider(...)
After unnesting geometry
it gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng).
Continue the pipe and extend it from unnest_wider()
to unnest_wider()
again and pass in location
to unnest location
.
... |> ...(city, formatted_address, geometry) |> unnest_wider(...)|> unnest_...(...)
Now we got the lat
and lng
of the location. Next we will extract the bounds
.
To unnest the bounds
, we first need to select the columns which we need, so after unnest_wider(geometry)
, delete the unnest_wider(location)
and instead continue the pipe with select()
and pass in !location:viewport
which excludes the columns from location to viewport. Then add the unnest_wider()
with bounds
as the arguement.
... |> ...(city, formatted_address, geometry) |> unnest_wider(...)|> select(!...:...)|> unnest_...(...)
Now we get two columns called northeast
and southwest
.
We then rename southwest and northeast (the corners of the rectangle) so we can use names_sep
to create short but evocative names when unnesting them.
Extend the pipe from unnest_wider(bounds)
to rename()
and pass in ne = northeast
and sw = southwest
. Then extend the pipe again to unnest_wider()
and pass in a vector c(ne, sw)
and also pass in names_sep = "_"
.
Note how we unnest two columns simultaneously by supplying a vector of variable names to unnest_wider()
.
Now that you’ve discovered the path to get to the components you’re interested in, you can extract them directly using another tidyr function, hoist()
:
Start a new pipe with locations
to select()
and select the following: city
, formatted_address
and geometry
, then continue the pipe from select()
to hoist()
and pass in the following:
geometry
,
ne_lat = c("bounds", "northeast", "lat")
,
sw_lat = c("bounds", "southwest", "lat")
,
ne_lng = c("bounds", "northeast", "lng")
,
sw_lng = c("bounds", "southwest", "lng")
.
... |> select(city, ..., geometry) |> ...( geometry, ne_lat = c("bounds", "northeast", "lat"), sw_lat = c("bounds", "southwest", "..."), ne_lng = c("bounds", "northeast", "lng"), sw_lng = c("bounds", "southwest", "lng"), )
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette("rectangling", package = "tidyr")
.
All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.
To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: read_json()
and parse_json()
. The repurrsive package also provides the source for gh_user
as a JSON file and you can read it with read_json()
.
Run gh_users_json()
which gives you the path to the json file.
gh_users_json()
Next up we will read the json file.
Create a new variable gh_user2
and assign it to the read_json()
function and pass in the gh_users_json()
function as argument.
gh_users2 <- read_json(...)
JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:
Inf
, -Inf
, or NaN
TRUE
and FALSE
, but uses lowercase true
and false
.Run identical()
and pass in gh_users
and gh_users2
.
identical(gh_users, gh_users2)
In this course, we’ll also use parse_json()
, since it takes a string containing JSON, which makes it good for generating simple examples.
To get started, let's parse a string which contains the number 1. So run str()
since we want a string output, then pass in parse_json()
and within that pass in "1"
.
str(..._...("1"))
What will the output look like if we have multiple numbers in an array?
Copy the previous code, and change the argument within parse_json()
to a string which contains the array [1,2,3]
.
str(..._join("[...,...,...]"))
The way we get the output is like a list. Now what happens if we mapped a variable to the array?
Copy the previous code, and within parse_json()
change the arguement to a string which looks like {"x": [1,2,3]}
.
...(parse_json('{"...": [1, 2, 3]}'))
JSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.
In most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g., multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with tibble(json)
so that each element becomes a row.
Run json
to have a quick look at the data before we dive in to parsing it and unnesting it.
json
jsonlite has another important function called fromJSON()
. We don’t use it here because it performs automatic simplification (simplifyVector = TRUE)
. This often works well, particularly in simple cases, but we think you’re better off doing the rectangling yourself so you know exactly what’s happening and can more easily handle the most complicated nested structures.
Now, let's transform the JSON to a list and store it within a tibble. Create a variable called df3
by using tibble()
. Within the call to tibble()
, we will have a json
column where json
is equal to parse_json(json)
.
df3 <- tibble(... = parse_json(...))
Yes, we have the data in tibble but we want to see the actual values instead of just seeing the amount of data in list being shown.
To make the names and age visible, we will start a pipe with df3
to unnest_wider()
and pass in json
as the argument.
df3 |> unnest_...(json)
In rarer cases, the JSON file consists of a single top-level JSON object, but what do we do to explore the data if we have many top level objects?
Run json1
to explore the dataset.
json1
Compared to the json file we explored previously, we have two levels: status
and results
.
Let's now parse the json1
and put it in a tibble. Create a variable df4
and assign it to tibble()
. Within tibble()
, set the column json1
equal to list(parse_json(json1))
. On a new line call df4
.
df4 <- ...(json ... list(parse_json(...)))
Both arrays and objects are similar to lists in R; the difference is whether or not they’re named. An array is like an unnamed list, and is written with []
. For example [1, 2, 3]
is an array containing 3 numbers, and [null, 1, "string", false]
is an array that contains a null, a number, a string, and a boolean.
Start a pipe with df4
to unnest_wider()
and pass in json1
as the argument which will unnest the top level objects status
and results
.
... |> unnest_wider(...)
An object is like a named list, and is written with {}
. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, {"x": 1, "y": 2}
is an object that maps x to 1
and y to 2
.
Copy the previous code and continue the pipe from unnest_wider()
to unnest_longer()
. Within unnest_longer()
pass in results
which will unnest the columns name
and age
in results
.
... |> unnest_wider(...) |> unnest_...(...)
If you want to explore more about jsonlite check out their website.
Copy the previous code and continue the pipe from unnest_longer()
to unnest_wider()
and within it pass in results
as the argument which will reveal the value within results.
... |> unnest_wider(...) |> unnest_...(...)|> ..._wider(...)
Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about.
To do it faster, all we had to is the following:
df4 <- tibble(results = parse_json(json1)$results) df4 |> unnest_wider(results)
Good work on finishing the tutorial!
This tutorial covered Chapter 23: Hierarchical data from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to work with non-rectanglar data using packages like jsonlite.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.