knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
With tibblify()
you can rectangle deeply nested lists into a tidy tibble. These
lists might come from an API in the form of JSON or from scraping XML. The reasons
to use tibblify()
over other tools like jsonlite::fromJSON()
or tidyr::hoist()
are:
jsonlite::fromJSON()
.jsonlite::fromJSON()
.Let's start with gh_users
, which is a list containing information about four
GitHub users.
library(tibblify) gh_users_small <- purrr::map(gh_users, ~ .x[c("followers", "login", "url", "name", "location", "email", "public_gists")]) names(gh_users_small[[1]])
Quickly rectangling gh_users_small
is as easy as applying tibblify()
to it:
tibblify(gh_users_small)
We can now look at the specification tibblify()
used for rectangling
guess_tspec(gh_users_small)
If we are only interested in some of the fields we can easily adapt the specification
spec <- tspec_df( login_name = tib_chr("login"), tib_chr("name"), tib_int("public_gists") ) tibblify(gh_users_small, spec)
We refer to lists like gh_users_small
as collection and objects are the
elements of such lists. Objects and collections are the typical input for
tibblify()
.
Basically, an object is simply something that can be converted to a one row tibble. This boils down to a condition on the names of the object:
object
must have names (the names
attribute must not be NULL
),NA
or ""
),In other words, the names must fulfill vec_as_names(repair = "check_unique")
.
The name-value pairs of an object are the fields.
For example list(x = 1, y = "a")
is an object with the fields (x, 1)
and
(y, "a")
but list(1, z = 3)
is not an object because it is not fully named.
A collection is basically just a list of similar objects so that the fields can become the columns in a tibble.
Providing an explicit specification has a couple of advantages:
As seen before the specification for a collection is done with tspec_df()
. The
columns of the output tibble are describe with the tib_*()
functions. They
describe the path to the field to extract and the output type of the field. There
are the following five types of functions:
tib_scalar(ptype)
: a length one vector with type ptype
tib_vector(ptype)
: a vector of arbitrary length with type ptype
tib_variant()
: a vector of arbitrary length and type; you should barely ever need thistib_row(...)
: an object with the fields ...
tib_df(...)
: a collection where the objects have the fields ...
For convenience there are shortcuts for tib_scalar()
and tib_vector()
for
the most common prototypes:
logical()
: tib_lgl()
and tib_lgl_vec()
integer()
: tib_int()
and tib_int_vec()
double()
: tib_dbl()
and tib_dbl_vec()
character()
: tib_chr()
and tib_chr_vec()
Date
: tib_date()
and tib_date_vec()
Date
encoded as character: tib_chr_date()
and tib_chr_date_vec()
Scalar elements are the most common case and result in a normal vector column
tibblify( list( list(id = 1, name = "Peter"), list(id = 2, name = "Lilly") ), tspec_df( tib_int("id"), tib_chr("name") ) )
With tib_scalar()
you can also provide your own prototype
Let's say you have a list with durations
x <- list( list(id = 1, duration = vctrs::new_duration(100)), list(id = 2, duration = vctrs::new_duration(200)) ) x
and then use it in tib_scalar()
tibblify( x, tspec_df( tib_int("id"), tib_scalar("duration", ptype = vctrs::new_duration()) ) )
If an element does not always have size one then it is a vector element. If it
still always has the same type ptype
then it produces a list of ptype
column:
x <- list( list(id = 1, children = c("Peter", "Lilly")), list(id = 2, children = "James"), list(id = 3, children = c("Emma", "Noah", "Charlotte")) ) tibblify( x, tspec_df( tib_int("id"), tib_chr_vec("children") ) )
You can use tidyr::unnest()
or tidyr::unnest_longer()
to flatten these columns to regular columns.
For example in gh_repos_small
gh_repos_small <- purrr::map(gh_repos, ~ .x[c("id", "name", "owner")]) gh_repos_small <- purrr::map( gh_repos_small, function(repo) { repo$owner <- repo$owner[c("login", "id", "url")] repo } ) gh_repos_small[[1]]
the field owner
is an object itself. The specification to extract it uses tib_row()
spec <- guess_tspec(gh_repos_small) spec
and results in a tibble column
tibblify(gh_repos_small, spec)
If you don't like the tibble column you can unpack it with tidyr::unpack()
.
Alternatively, if you only want to extract some of the fields in owner
you
can use a nested path
spec2 <- tspec_df( id = tib_int("id"), name = tib_chr("name"), owner_id = tib_int(c("owner", "id")), owner_login = tib_chr(c("owner", "login")) ) spec2 tibblify(gh_repos_small, spec2)
Objects usually have some fields that always exist and some that are optional.
By default tib_*()
demands that a field exists
x <- list( list(x = 1, y = "a"), list(x = 2) ) spec <- tspec_df( x = tib_int("x"), y = tib_chr("y") ) tibblify(x, spec)
You can mark a field as optional with the argument required = FALSE
:
spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE) ) tibblify(x, spec)
You can specify the value to use with the fill
argument
spec <- tspec_df( x = tib_int("x"), y = tib_chr("y", required = FALSE, fill = "missing") ) tibblify(x, spec)
To rectangle a single object you have two options: tspec_object()
which produces
a list or tspec_row()
which produces a tibble with one row.
While tibbles are great for a single object it often makes more sense to convert them to a list.
For example a typical API response might be something like
api_output <- list( status = "success", requested_at = "2021-10-26 09:17:12", data = list( list(x = 1), list(x = 2) ) )
To convert to a one row tibble
row_spec <- tspec_row( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) ) api_output_df <- tibblify(api_output, row_spec) api_output_df
it is necessary to wrap data
in a list. To access data
one has to use
api_output_df$data[[1]]
which is not very nice.
object_spec <- tspec_object( status = tib_chr("status"), data = tib_df( "data", x = tib_int("x") ) ) api_output_list <- tibblify(api_output, object_spec) api_output_list
Now accessing data
does not required an extra subsetting step
api_output_list$data
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.