Data == knowledge! Much of the data we use, whether it be from government repositories, social media, GitHub, or e-commerce sites comes from public-facing APIs. The quantity of data available is truly staggering, but munging JSON output into a format that is easily analyzable in R is an equally staggering undertaking. When JSON is turned into an R object, it usually becomes a deeply nested list riddled with missing values that is difficult to untangle into a tidy format. Moreover, every API presents its own challenges; code you've written to clean up data from GitHub isn't necessarily going to work on Twitter data, as each API spews data out in its own unique, headache-inducing nested list structure. To ease and generalize this process, Amanda Dobbyn proposed an #unconf18 project for a general API response tidier! Welcome roomba, our first stab at easing the process of tidying nested lists!

Drawing

roomba will eventually be able to walk nested lists in a variety of different structures from JSON output, replace NULL or .empty values with NAs or a user-specified value, and return a tibble with names matching a user-specified list. Of course, in two days we haven't fully achieved this vision, but we're off to a promising start.

The birth of roomba

It was clear Amanda was on to something good by the lively discussion in the #runconf18 issues repository leading up to the unconf. Thanks to input from Jenny Bryan, Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam Ross, we had a lot of ideas to work with when the unconf began. Fortunately, Jim already had a function called dfs_idx() (here) written to perform depth-first searches of nested lists from the GitNub GraphQL API. With the core list-traversal code out of the way, we split our efforts between developing a usable interface, stockpiling .JSON files to test on, and developing a Shiny app.

What's working

We've got the basic structure of roomba sorted out, and you should install it from GitHub to try out! Here are a few of the examples we've put together.

library(roomba)
#load twitter data example
data(twitter_data)

#roomba-fy!
roomba(twitter_data, c("created_at", "name"))

And just the first element of the twitter_data list will show you that roomba has simplified this process quite a bit.

twitter_data[[1]]

We created a Shiny app too, which in its current state allows you to select a .Rda or .JSON file, pick two variables, and create a scatterplot of them.

Run the app like this:

shiny_roomba()

What's not

Of course, in two days we weren't able to build a magical one-size-fits-all solution to every API response data headache. Right now, the main barrier to usability is that both the roomba() function and shiny_roomba() app only work on sub-list items of the same length and same data type stored at the same depth. To illustrate on the twitter_data:

#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))


#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))

In addition, we've got some features we want to add, such as handling a larger variety of column names (i.e. passing a string for a single column name, keeping all values even if they are all NULL). We would love your feedback on other things we can add!

The team

Amanda Dobbyn
Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for dfs_idx()

Jim Hester
Job: Software Engineer at RStudio
Project contributions: dfs_idx() and remove_nulls() functions, package building, README, and debugging

Christine Stawitz
Job: Fishery Biologist at NOAA Fisheries
Project contributions: Shiny app, README and blog post writing

Laura DeCicco
Job: Data Scientist at U.S. Geological Survey
Project contributions: Fixing merge conflicts :)

Isabella Velasquez
Job: Data Analyst at the Bill & Melinda Gates Foundation
Project contributions: hex sticker!



ropenscilabs/roomba documentation built on July 26, 2021, 7:37 p.m.