knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
tidyjson provides tools for turning complex json into tidy data.
Get the released version from CRAN:
install.packages("tidyjson")
or the development version from github:
devtools::install_github("colearendt/tidyjson")
The following example takes a character vector of
r library(tidyjson);length(worldbank)
documents in the worldbank
dataset and spreads out all objects.
Every JSON object key gets its own column with types inferred, so long
as the key does not represent an array. When recursive=TRUE
(the default behavior),
spread_all
does this recursively for nested objects and creates column names
using the sep
parameter (i.e. {"a":{"b":1}}
with sep='.'
would
generate a single column: a.b
).
library(dplyr) library(tidyjson) worldbank %>% spread_all
Some objects in worldbank
are arrays, which are not handled by spread_all
. This example shows how
to quickly summarize the top level structure of a JSON collection
worldbank %>% gather_object %>% json_types %>% count(name, type)
In order to capture the data in the majorsector_percent
array, we can use enter_object
to enter into that object, gather_array
to stack the array and spread_all
to capture the object items under the array.
worldbank %>% enter_object(majorsector_percent) %>% gather_array %>% spread_all %>% select(-document.id, -array.index)
spread_all()
for spreading all object values into new columns, with nested
objects having concatenated names
spread_values()
for specifying a subset of object values to spread into new
columns using the jstring()
, jinteger()
, jdouble()
and jlogical()
functions. It is possible to specify multiple parameters to extract data from
nested objects (i.e. jstring('a','b')
).
enter_object()
for entering into an object by name, discarding all other
JSON (and rows without the corresponding object name) and allowing further
operations on the object value
gather_object()
for stacking all object name-value pairs by name, expanding
the rows of the tbl_json
object accordingly
gather_array()
for stacking all array values by index, expanding the
rows of the tbl_json
object accordinglyjson_types()
for identifying JSON data types
json_length()
for computing the length of JSON data (can be larger than
1
for objects and arrays)
json_complexity()
for computing the length of the unnested JSON, i.e.,
how many terminal leaves there are in a complex JSON structure
is_json
family of functions for testing the type of JSON data
json_structure()
for creating a single fixed column data.frame that
recursively structures arbitrary JSON data
json_schema()
for representing the schema of complex JSON, unioned across
disparate JSON documents, and collapsing arrays to their most complex type
representation
as.tbl_json()
for converting a string or character vector into a tbl_json
object, or for converting a data.frame
with a JSON column using the
json.column
argument
tbl_json()
for combining a data.frame
and associated list
derived
from JSON data into a tbl_json
object
read_json()
for reading JSON data from a file
as.character.tbl_json
for converting the JSON attribute of a tbl_json
object back into a JSON character stringcommits
: commit data for the dplyr repo from github API
issues
: issue data for the dplyr repo from github API
worldbank
: world bank funded projects from jsonstudio
companies
: startup company data from jsonstudio
The goal is to turn complex JSON data, which is often represented as nested lists, into tidy data frames that can be more easily manipulated.
Work on a single JSON document, or on a collection of related documents
Create pipelines with %>%
, producing code that can be read from left to
right
Guarantee the structure of the data produced, even if the input JSON
structure changes (with the exception of spread_all
)
Work with arbitrarily nested arrays or objects
Handle 'ragged' arrays and / or objects (varying lengths by document)
Allow for extraction of data in values or object names
Ensure edge cases are handled correctly (especially empty data)
Integrate seamlessly with dplyr
, allowing tbl_json
objects to pipe in and
out of dplyr
verbs where reasonable
Tidyjson depends upon
%>%
pipe operatorFurther, there are other R packages that can be used to better understand JSON data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.