knitr::opts_chunk$set(collapse = T, comment = "#>") knitr::opts_chunk$set(fig.width = 7, fig.height = 5) options(tibble.print_min = 4L, tibble.print_max = 4L)
JSON is a very simple data standard that, through nested data structures, can represent incredibly complex datasets. In some cases, a set of JSON data closely corresponds to a table in a SQL database. However, more commonly a JSON document more closely maps to an entire SQL database.
Understanding the structure of your JSON data is critical before you begin
analyzing the data. In this vignette, we use tidyjson
to inspect the
structure of JSON data and then create various visualizations to help
understand a complex JSON dataset.
For a refresher on JSON, see the JSON specification, which is a very concise summary of how JSON is formatted. In essence, there are three types of JSON data structures.
Per the specification, an object is a name/value pair, like
'{"string": "value"}'
:
An array is an ordered list, like '[1, 2, 3]'
:
A value is a string, number, logical or NULL scalar:
What is particularly interesting about JSON is the following:
All of this makes visualizing JSON data structures very useful when embarking on an analysis.
Before we start, let's load tidyjson
along with other data manipulation and
visualization libraries, and set a seed so we get consistent results.
library(jsonlite) library(dplyr) library(purrr) library(magrittr) library(forcats) library(ggplot2) library(igraph) library(RColorBrewer) library(wordcloud) library(viridis) library(listviewer) library(tidyjson) set.seed(1)
Let's work with the companies
dataset included in the tidyjson
package,
originating at 'jsonstudio.' It is a
r class(companies)
vector of r length(companies) %>% format(big.mark = ',')
JSON strings, each describing a startup company.
We can start by finding out how complex each record is by using
json_complexity
:
co_length <- companies %>% json_complexity
Then we can visualize the distribution of company documents by complexity on a log-scale:
co_length %>% ggplot(aes(complexity)) + geom_density() + scale_x_log10() + annotation_logticks(side = 'b')
It appears that some companies have unlisted length less than 10, while others
are in the hundreds or even thousands. The median is
r median(co_length$complexity)
.
Let's pick an example that is particularly small to start with:
co_examp_index <- which(co_length$complexity == 20L)[1] co_examp <- companies[co_examp_index] co_examp
Even for such a small example it's hard to understand the structure from the
raw JSON. We can instead use listviewer::jsonedit
to view it:
co_examp %>% jsonedit(mode = "code")
This is great for understanding a single JSON document. But many of the objects are empty arrays, and so give us very little insight into the structure of the collection as a whole.
To start working with the entire collection, let's use the json_structure
function in tidyjson which gives us a data.frame
where each row corresponds
to an object, array or scalar in the JSON document.
co_struct <- companies %>% sample(5) %>% json_structure print(co_struct)
We can then aggregate all of the object names across the entire collection,
excluding null
values to count the number of documents with meaningful data
under each object name.
co_names <- co_struct %>% filter(type != "null" & !is.na(name)) %>% group_by(level, name, type) %>% summarize(ndoc = n_distinct(document.id)) co_names
We can get a quick overview of the most common names using a wordcloud
.
co_names %$% wordcloud(name, ndoc, scale = c(1.5, .1), min.freq = 100)
Alternatively, we can visualize all the names in ggplot2.
co_names %>% ungroup %>% group_by(type) %>% arrange(desc(ndoc), level) %>% mutate(rank = 1:n()) %>% ggplot(aes(1, rank)) + geom_text(aes(label = name, color = ndoc)) + scale_y_reverse() + facet_grid(. ~ type) + theme_void() + theme(legend.position = "bottom") + scale_color_viridis(direction = -1)
This shows there are many comon strings and arrays, and many uncommon objects, strings and numbers in the documents.
To really understand the structure of a document, we need to visaulize it as
a graph. Let's define a plot_json_graph
function to use igraph
to visualize
our JSON documents
# Plots an igraph visualization of a JSON document # # @param .x a JSON string or tbl_json object # @param legend add a type color legend automatically # @param vertex.size the size of the vertices # @param edge.color the color for the edges # @param edge.width the width of the edge lines # @param show.labels should object names be shown # @param plot should the plot be rendered? # @param ... further arguments to igraph::plot.igraph plot_json_graph <- function(.x, legend = TRUE, vertex.size = 6, edge.color = 'grey70', edge.width = .5, show.labels = TRUE, plot = TRUE, ...) { if (!is.tbl_json(.x)) .x <- as.tbl_json(.x) if (nrow(.x) != 1) stop("nrow(.x) not equal to 1") structure <- .x %>% json_structure type_colors <- RColorBrewer::brewer.pal(6, "Accent") graph_edges <- structure %>% filter(!is.na(parent.id)) %>% select(parent.id, child.id) graph_vertices <- structure %>% transmute(child.id, vertex.color = type_colors[as.integer(type)], vertex.label = name) if (!show.labels) graph_vertices$vertex.label <- rep(NA_character_, nrow(graph_vertices)) g <- igraph::graph_from_data_frame(graph_edges, vertices = graph_vertices, directed = FALSE) if (plot) { op <- par(mar = c(0, 0, 0, 0)) plt <- igraph::plot.igraph(g, vertex.color = igraph::V(g)$vertex.color, vertex.size = vertex.size, vertex.label = igraph::V(g)$vertex.label, vertex.frame.color = NA, layout = layout_with_kk, edge.color = edge.color, edge.width = edge.width, ...) if (legend) legend(x = -1.3, y = -.6, levels(structure$type), pch = 21, col= "white", pt.bg = type_colors, pt.cex = 2, cex = .8, bty = "n", ncol = 1) par(op) } invisible(g) }
Let's use the plot_json_graph
function to look at our simple example:
'{"object" : {"name": 1}, "array" : ["a", "b"], "string" : "value", "number" : 1, "logical": true, "null" : null}' %>% plot_json_graph
plot_json_graph
produces a graph where every node in the JSON document appears
as a vertex, and parent nodes are connected to children through edges. The color
is coded to indicate what type each node is, and vertices are labeled based on
the name (for objects) used to identify the node.
Now let's look at a real company example:
co_examp %>% plot_json_graph
For this company, almost all of the data is in the top level object.
Now, let's create a function that plots a panel of these graphs:
plot_json_graph_panel <- function(json, nrow, ncol, ...) { # Set up grid op <- par(mfrow = c(nrow, ncol)) indices <- seq_along(json) %>% keep(`<=`, nrow * ncol) for (i in indices) { plot_json_graph(json[[i]], ...) if ("names" %in% names(attributes(json))) title(names(json)[i], col.main = 'red') } par(op) invisible(NULL) }
And let's look at several simultaneously:
plot_json_graph_panel(companies %>% sample(5), 7, 6, legend = FALSE, show.labels = FALSE, vertex.size = 4)
Clearly there is a huge amount of variety in the JSON documents!
Let's look at the most complex example:
most_complex <- companies[which(co_length$complexity == max(co_length$complexity))] most_complex_name <- most_complex %>% spread_values(name = jstring(name)) %>% extract2("name")
The most complex company is r most_complex_name
! Let's try to plot it:
plot_json_graph(most_complex, show.labels = FALSE, vertex.size = 2)
That is just too big. There are many arrays of complex objects that are
repetitive in structure. Instead, we can simplify the structure by using
json_schema
.
most_complex %>% json_schema %>% jsonedit(mode = "code")
We can visualize this as a graph, and get more meaningful coloring of the
terminal nodes by instructing json_schema
to use type = "value"
.
most_complex %>% json_schema(type = "value") %>% plot_json_graph
This is overwhelmed by top level scalar objects. We can visualize the more complex objects only
most_complex %>% gather_object %>% json_types %>% json_complexity %>% filter(type %in% c('array', 'object') & complexity >= 15) %>% split(.$name) %>% map(json_schema, type = "value") %>% plot_json_graph_panel(3, 3, legend = FALSE)
Now let's use this insight to structure funding and geo data for a visualization.
First, let's get funding round data:
rounds <- companies %>% enter_object(funding_rounds) %>% gather_array %>% spread_values( round = jstring(round_code), currency = jstring(raised_currency_code), raised = jnumber(raised_amount) ) rounds %>% head
Now, let's get geo data:
geos <- companies %>% enter_object(offices) %>% gather_array %>% spread_values( country = jstring(country_code), state = jstring(state_code), description = jstring(description) ) geos %>% head
Now, let's join together the data for the US companies, assuming that the
first office in the array is the headquarters (seems reasonable from a quick
visual inspection of geos
).
hqs <- geos %>% filter(array.index == 1) %>% filter(country == "USA") %>% select(document.id, state) rounds_usd <- rounds %>% filter(currency == "USD") %>% filter(!is.na(raised)) %>% select(document.id, round, raised) rounds_by_geo <- inner_join(rounds_usd, hqs, by = "document.id") %>% as_tibble()
Now we can visualize the results
round_prep <- rounds_by_geo %>% dplyr::filter(!is.na(state)) %>% mutate( round = round %>% forcats::fct_collapse( "angel" = c("seed", "angel"), "d-f" = c("d", "e", "f"), "other" = c("grant", "partial", "post_ipo_equity", "private_equity", "debt_round", "unattributed") ) %>% forcats::fct_relevel("angel", "a", "b", "c", "d-f", "other") ) %>% mutate( state = state %>% forcats::fct_lump(2) ) g <- ggplot(round_prep, aes(state, raised, fill = state)) + geom_violin() + scale_y_log10() + annotation_logticks(side = 'l') + facet_grid(. ~ round) + theme(legend.position = "bottom") + labs(x = "", y = "Amount Raised (USD)") g
Which shows a few interesting things:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.