Data == knowledge! Much of the data we use, whether it be from
government repositories, social media, GitHub, or e-commerce sites comes
from public-facing APIs. The quantity of data available is truly
staggering, but munging JSON output into a format that is easily
analyzable in R is an equally staggering undertaking. When JSON is
turned into an R object, it usually becomes a deeply nested list riddled
with missing values that is difficult to untangle into a tidy format.
Moreover, every API presents its own challenges; code you've written to
clean up data from GitHub isn't necessarily going to work on Twitter
data, as each API spews data out in its own unique, headache-inducing
nested list structure. To ease and generalize this process, Amanda
Dobbyn proposed an
#unconf18 project for a general API response tidier! Welcome roomba
,
our first stab at easing the process of tidying nested lists!
roomba
will eventually be able to walk nested lists in a variety of
different structures from JSON output, replace NULL
or .empty
values
with NA
s or a user-specified value, and return a tibble
with names
matching a user-specified list. Of course, in two days we haven't
fully achieved this vision, but we're off to a promising start.
It was clear Amanda was on to something good by the lively discussion in
the #runconf18 issues
repository leading up to the unconf. Thanks to input from Jenny Bryan,
Jim Hester, Carl Boettinger, Scott Chamberlain, Bob Rudis, and Noam
Ross, we had a lot of ideas to work with when the unconf began.
Fortunately, Jim already had a function called dfs_idx()
(here)
written to perform depth-first searches of nested lists from the GitNub
GraphQL API. With the core
list-traversal code out of the way, we split our efforts between
developing a usable interface, stockpiling .JSON
files to test on, and
developing a Shiny app.
We've got the basic structure of roomba
sorted out, and you should
install it from GitHub to try out! Here are a few of the examples we've
put together.
library(roomba)
#load twitter data example
data(twitter_data)
#roomba-fy!
roomba(twitter_data, c("created_at", "name"))
## # A tibble: 24 x 2
## name created_at
## <chr> <chr>
## 1 Code for America Mon Aug 10 18:59:29 +0000 2009
## 2 Ben Lorica <U+7F57><U+745E><U+5361> Mon Dec 22 22:06:18 +0000 2008
## 3 Dan Sholler Thu Apr 03 20:09:24 +0000 2014
## 4 Code for America Mon Aug 10 18:59:29 +0000 2009
## 5 FiveThirtyEight Tue Jan 21 21:39:32 +0000 2014
## 6 Digital Impact Wed Oct 07 21:10:53 +0000 2009
## 7 Drew Williams Thu Aug 07 18:41:29 +0000 2014
## 8 joe Fri May 29 13:25:25 +0000 2009
## 9 Data Analysts 4 Good Wed May 07 16:55:33 +0000 2014
## 10 Ryan Frederick Sun Mar 01 19:06:53 +0000 2009
## # ... with 14 more rows
And just the first element of the twitter_data
list will show you
that roomba
has simplified this process quite a bit.
twitter_data[[1]]
## $created_at
## [1] "Mon May 21 17:58:09 +0000 2018"
##
## $id
## [1] 9.98624e+17
##
## $id_str
## [1] "998623997397876743"
##
## $text
## [1] "Could a program like food stamps have a Cambridge Analytica moment? How do we allow for the innovation that data pl
https://t.co/7tVf1qmNmq"
##
## $truncated
## [1] TRUE
##
## $entities
## $entities$hashtags
## list()
##
## $entities$symbols
## list()
##
## $entities$user_mentions
## list()
##
## $entities$urls
## $entities$urls[[1]]
## $entities$urls[[1]]$url
## [1] "https://t.co/7tVf1qmNmq"
##
## $entities$urls[[1]]$expanded_url
## [1] "https://twitter.com/i/web/status/998623997397876743"
##
## $entities$urls[[1]]$display_url
## [1] "twitter.com/i/web/status/9
"
##
## $entities$urls[[1]]$indices
## $entities$urls[[1]]$indices[[1]]
## [1] 117
##
## $entities$urls[[1]]$indices[[2]]
## [1] 140
##
##
##
##
##
## $source
## [1] "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>"
##
## $in_reply_to_status_id
## NULL
##
## $in_reply_to_status_id_str
## NULL
##
## $in_reply_to_user_id
## NULL
##
## $in_reply_to_user_id_str
## NULL
##
## $in_reply_to_screen_name
## NULL
##
## $user
## $user$id
## [1] 64482503
##
## $user$id_str
## [1] "64482503"
##
## $user$name
## [1] "Code for America"
##
## $user$screen_name
## [1] "codeforamerica"
##
## $user$location
## [1] "San Francisco, California"
##
## $user$description
## [1] "Government can work for the people, by the people, in the 21st century. Help us make it so."
##
## $user$url
## [1] "https://t.co/l9lokka0rJ"
##
## $user$entities
## $user$entities$url
## $user$entities$url$urls
## $user$entities$url$urls[[1]]
## $user$entities$url$urls[[1]]$url
## [1] "https://t.co/l9lokka0rJ"
##
## $user$entities$url$urls[[1]]$expanded_url
## [1] "http://codeforamerica.org"
##
## $user$entities$url$urls[[1]]$display_url
## [1] "codeforamerica.org"
##
## $user$entities$url$urls[[1]]$indices
## $user$entities$url$urls[[1]]$indices[[1]]
## [1] 0
##
## $user$entities$url$urls[[1]]$indices[[2]]
## [1] 23
##
##
##
##
##
## $user$entities$description
## $user$entities$description$urls
## list()
##
##
##
## $user$protected
## [1] FALSE
##
## $user$followers_count
## [1] 49202
##
## $user$friends_count
## [1] 1716
##
## $user$listed_count
## [1] 2659
##
## $user$created_at
## [1] "Mon Aug 10 18:59:29 +0000 2009"
##
## $user$favourites_count
## [1] 4490
##
## $user$utc_offset
## [1] -25200
##
## $user$time_zone
## [1] "Pacific Time (US & Canada)"
##
## $user$geo_enabled
## [1] TRUE
##
## $user$verified
## [1] TRUE
##
## $user$statuses_count
## [1] 15912
##
## $user$lang
## [1] "en"
##
## $user$contributors_enabled
## [1] FALSE
##
## $user$is_translator
## [1] FALSE
##
## $user$is_translation_enabled
## [1] FALSE
##
## $user$profile_background_color
## [1] "EBEBEB"
##
## $user$profile_background_image_url
## [1] "http://abs.twimg.com/images/themes/theme7/bg.gif"
##
## $user$profile_background_image_url_https
## [1] "https://abs.twimg.com/images/themes/theme7/bg.gif"
##
## $user$profile_background_tile
## [1] FALSE
##
## $user$profile_image_url
## [1] "http://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
##
## $user$profile_image_url_https
## [1] "https://pbs.twimg.com/profile_images/615534833645678592/iAO_Lytr_normal.jpg"
##
## $user$profile_banner_url
## [1] "https://pbs.twimg.com/profile_banners/64482503/1497895952"
##
## $user$profile_link_color
## [1] "CF1B41"
##
## $user$profile_sidebar_border_color
## [1] "FFFFFF"
##
## $user$profile_sidebar_fill_color
## [1] "F3F3F3"
##
## $user$profile_text_color
## [1] "333333"
##
## $user$profile_use_background_image
## [1] FALSE
##
## $user$has_extended_profile
## [1] FALSE
##
## $user$default_profile
## [1] FALSE
##
## $user$default_profile_image
## [1] FALSE
##
## $user$following
## [1] TRUE
##
## $user$follow_request_sent
## [1] FALSE
##
## $user$notifications
## [1] FALSE
##
## $user$translator_type
## [1] "none"
##
##
## $geo
## NULL
##
## $coordinates
## NULL
##
## $place
## NULL
##
## $contributors
## NULL
##
## $is_quote_status
## [1] FALSE
##
## $retweet_count
## [1] 0
##
## $favorite_count
## [1] 0
##
## $favorited
## [1] FALSE
##
## $retweeted
## [1] FALSE
##
## $possibly_sensitive
## [1] FALSE
##
## $possibly_sensitive_appealable
## [1] FALSE
##
## $lang
## [1] "en"
We created a Shiny app too, which in its current state allows you to
select a .Rda
or .JSON
file, pick two variables, and create a
scatterplot of them.
Run the app like this:
shiny_roomba()
Of course, in two days we weren't able to build a magical
one-size-fits-all solution to every API response data headache. Right
now, the main barrier to usability is that both the roomba()
function
and shiny_roomba()
app only work on sub-list items of the same length
and same data type stored at the same depth. To illustrate on the
twitter_data
:
#This doesn't work because "user" has data of different types and lengths
roomba(twitter_data, c("user"))
## # A tibble: 1,007 x 1
## user
## <list>
## 1 <int [1]>
## 2 <chr [1]>
## 3 <chr [1]>
## 4 <chr [1]>
## 5 <chr [1]>
## 6 <chr [1]>
## 7 <chr [1]>
## 8 <list [2]>
## 9 <lgl [1]>
## 10 <int [1]>
## # ... with 997 more rows
#This doesn't work because "name" and "retweet_count" are at different depths.
roomba(twitter_data, c("name","retweet_count"))
## # A tibble: 0 x 0
In addition, we've got some features we want to add, such as handling a
larger variety of column names (i.e. passing a string for a single
column name, keeping all values even if they are all NULL
). We would
love your feedback on other things we can add!
Amanda Dobbyn Job: Data Scientist at Earlybird Software
Project contributions: initial GH issue, package name, wrapper for
dfs_idx()
Jim Hester Job: Software Engineer at RStudio Project
contributions: dfs_idx()
and remove_nulls()
functions, package
building, README, and debugging
Christine Stawitz Job: Postdoctoral researcher at UW's School of Aquatic and Fishery Sciences Project contributions: Shiny app, README and blog post writing
Laura DeCicco Job: Data Scientist at U.S. Geological Survey Project contributions: Fixing merge conflicts :)
Isabella Velasquez Job: Data Analyst at the Bill & Melinda Gates Foundation Project contributions: hex sticker!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.