Data == knowledge! Much of the data we use, whether it be from government repositories, social media, GitHub, or e-commerce sites comes from public-facing APIs. The quantity of data available is truly staggering, but munging JSON output into a format that is easily analyzable in R is an equally staggering undertaking. When JSON is turned into an R object, it usually becomes a deeply nested list riddled with missing values that is difficult to untangle into a tidy format. Moreover, every API presents its own challenges; code you've written to clean up data from GitHub isn't necessarily going to work on Twitter data, as each API spews data out in its own unique, headache-inducing nested list structure. To ease and generalize this process, Amanda Dobbyn proposed an #unconf18 project for a general API response tidier! Welcome roomba
, our first stab at easing the process of tidying nested lists!
roomba
will eventually be able to walk nested lists in a variety of different structures from JSON output, replace NULL
or .empty
values with NA
s or a user-specified value, and return a tibble
with names matching a user-specified list. Of course, in two days we haven't fully achieved this vision, but we're off to a promising start.
It was clear Amanda was on to something good by the lively discussion in the #runconf18 issues repository leading up to the unconf. Thanks to input from Jenny Bryan, Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam Ross, we had a lot of ideas to work with when the unconf began. Fortunately, Jim already had a function called dfs_idx()
(here) written to perform depth-first searches of nested lists from the GitNub GraphQL API. With the core list-traversal code out of the way, we split our efforts between developing a usable interface, stockpiling .JSON
files to test on, and developing a Shiny app.
We've got the basic structure of roomba
sorted out, and you should install it from GitHub to try out! Here are a few of the examples we've put together.
library(roomba) #load twitter data example data(twitter_data) #roomba-fy! roomba(twitter_data, c("created_at", "name"))
And just the first element of the twitter_data
list will show you that roomba
has simplified this process quite a bit.
twitter_data[[1]]
We created a Shiny app too, which in its current state allows you to select a .Rda
or .JSON
file, pick two variables, and create a scatterplot of them.
Run the app like this:
shiny_roomba()
Of course, in two days we weren't able to build a magical one-size-fits-all solution to every API response data headache. Right now, the main barrier to usability is that both the roomba()
function and shiny_roomba()
app only work on sub-list items of the same length and same data type stored at the same depth. To illustrate on the twitter_data
:
#This doesn't work because "user" has data of different types and lengths roomba(twitter_data, c("user")) #This doesn't work because "name" and "retweet_count" are at different depths. roomba(twitter_data, c("name","retweet_count"))
In addition, we've got some features we want to add, such as handling a larger variety of column names (i.e. passing a string for a single column name, keeping all values even if they are all NULL
). We would love your feedback on other things we can add!
Amanda Dobbyn
Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for dfs_idx()
Jim Hester
Job: Software Engineer at RStudio
Project contributions: dfs_idx()
and remove_nulls()
functions, package building, README, and debugging
Christine Stawitz
Job: Fishery Biologist at NOAA Fisheries
Project contributions: Shiny app, README and blog post writing
Laura DeCicco
Job: Data Scientist at U.S. Geological Survey
Project contributions: Fixing merge conflicts :)
Isabella Velasquez
Job: Data Analyst at the Bill & Melinda Gates Foundation
Project contributions: hex sticker!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.