knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
With tibblify()
you can rectangle deeply nested lists into a tidy tibble. These
lists might come from an API in the form of JSON or from scraping XML. The reasons
to use tibblify()
over other tools like jsonlite::fromJSON()
or tidyr::hoist()
are:
jsonlite::fromJSON()
.jsonlite::fromJSON()
.Let's start with gh_users
, which is a list containing information about four
GitHub users.
library(tibblify) gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")]) names(gh_users_small[[1]])
Quickly rectangling gh_users_small
is as easy as applying tibblify()
to it:
tibblify(gh_users_small)
We can now look at the specification tibblify()
used for rectangling
guess_tspec(gh_users_small)
If we are only interested in some of the fields we can easily adapt the specification
spec <- tspec_df( login_name = tib_chr("login"), tib_chr("name"), tib_int("public_gists") ) tibblify(gh_users_small, spec)
We refer to lists like gh_users_small
as collection and objects are the
elements of such lists. Objects and collections are the typical input for
tibblify()
.
Basically, an object is simply something that can be converted to a one row tibble. This boils down to a condition on the names of the object:
object
must have names (the names
attribute must not be NULL
),NA
or ""
),In other words, the names must fulfill vec_as_names(repair = "check_unique")
.
The name-value pairs of an object are the fields.
For example list(x = 1, y = "a")
is an object with the fields (x, 1)
and
(y, "a")
but list(1, z = 3)
is not an object because it is not fully named.
A collection is basically just a list of similar objects so that the fields can become the columns in a tibble.
Providing an explicit specification has a couple of advantages:
As seen before the specification for a collection is done with tspec_df()
. The
columns of the output tibble are describe with the tib_*()
functions. They
describe the path to the field to extract and the output type of the field. There
are the following five types of functions:
tib_scalar(ptype)
: a length one vector with type ptype
tib_vector(ptype)
: a vector of arbitrary length with type ptype
tib_variant()
: a vector of arbitrary length and type; you should barely ever need thistib_row(...)
: an object with the fields ...
tib_df(...)
: a collection where the objects have the fields ...
For convenience there are shortcuts for tib_scalar()
and tib_vector()
for
the most common prototypes:
logical()
: tib_lgl()
and tib_lgl_vec()
integer()
: tib_int()
and tib_int_vec()
double()
: tib_dbl()
and tib_dbl_vec()
character()
: tib_chr()
and tib_chr_vec()
Date
: tib_date()
and tib_date_vec()
Date
encoded as character: tib_chr_date()
and tib_chr_date_vec()
Scalar elements are the most common case and result in a normal vector column
tibblify( list( list(id = 1, name = "Peter"), list(id = 2, name = "Lilly") ), tspec_df( tib_int("id"), tib_chr("name") ) )
With tib_scalar()
you can also provide your own prototype
Let's say you have a list with durations
x <- list( list(id = 1, duration = vctrs::new_duration(100)), list(id = 2, duration = vctrs::new_duration(200)) ) x
and then use it in tib_scalar()
tibblify( x, tspec_df( tib_int("id"), tib_scalar("duration", ptype = vctrs::new_duration()) ) )
If an element does not always have size one then it is a vector element. If it
still always has the same type ptype
then it produces a list of ptype
column:
x <- list( list(id = 1, children = c("Peter", "Lilly")), list(id = 2, children = "James"), list(id = 3, children = c("Emma", "Noah", "Charlotte")) ) tibblify( x, tspec_df( tib_int("id"), tib_chr_vec("children") ) )
You can use tidyr::unnest()
or tidyr::unnest_longer()
to flatten these columns to regular columns.
For example in gh_repos_small
gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")]) gh_repos_small <- purrr::map( gh_repos_small, function(repo) { repo$owner <- repo$owner[c("login", "id", "url")] repo } ) gh_repos_small[[1]]
the field owner
is an object itself. The specification to extract it uses tib_row()
spec <- guess_tspec(gh_repos_small) spec
and results in a tibble column
tibblify(gh_repos_small, spec)
If you don't like the tibble column you can unpack it with tidyr::unpack()
.
Alternatively, if you only want to extract some of the fields in owner
you
can use a nested path
spec2 <- tspec_df( id = tib_int("id"), name = tib_chr("name"), owner_id = tib_int(c("owner", "id")), owner_login = tib_chr(c("owner", "login")) ) spec2 tibblify(gh_repos_small, spec2)
Objects usually have some fields that always exist and some that are optional.
By default tib_*()
demands that a field exists
x <- list( list(x = 1, y = "a"), list(x = 2) ) spec <- tspec_df( x = tib_int("x"), y = tib_chr("y") ) tibblify(x, spec)
You can mark a field as optional with the argument required = FALSE
:
spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE) ) tibblify(x, spec)
You can specify the value to use with the fill
argument
spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE, fill = "missing") ) tibblify(x, spec)
To rectangle a single object you have two options: tspec_object()
which produces
a list or tspec_row()
which produces a tibble with one row.
While tibbles are great for a single object it often makes more sense to convert them to a list.
For example a typical API response might be something like
api_output <- list( status = "success", requested_at = "2021-10-26 09:17:12", data = list( list(x = 1), list(x = 2) ) )
To convert to a one row tibble
row_spec <- tspec_row( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) ) api_output_df <- tibblify(api_output, row_spec) api_output_df
it is necessary to wrap data
in a list. To access data
one has to use
api_output_df$data[[1]]
which is not very nice.
object_spec <- tspec_object( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) ) api_output_list <- tibblify(api_output, object_spec) api_output_list
Now accessing data
does not required an extra subsetting step
api_output_list$data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.