In ropensci/rdflib: Tools to Manipulate and Query Semantic Data

rdflib is really just a lightweight wrapper around two existing R packages: redland, and jsonld, which themselves are (less trivial) wrappers around existing libraries (the redland C library, and the JSON-LD javascript implementation) which themselves are instances of a set of W3C Standards for the representation of linked data. rdflib has two key features: a simpler, higher-level interface for the common tasks performed in the redland library (user does not have to manage world, model and storage objects by default just to perform standard operations and conversions), and integration between the now popular json-ld serialization, which is not part of the redland library.

library(rdflib)
library(magrittr) # pipes

## JSON toolkit 
library(jsonld)
library(jsonlite)
library(jqr)

## For accessing some remote data sources
library(httr)
library(xml2)
library(readr)

## for some typical data processing
library(dplyr)
library(lubridate)

SPARQL queries on JSON-LD data

One example of this utility is the ability to perform graph queries using SPARQL on JSON-LD datasets. The SPARQL query language is analgous to other query languages such as SQL, but instead of working on an existing set of tables in a relational database, we can query data from a triplestore.

This is sometimes called "schema on read", since our query will define the schema (i.e. the structure or skeleton of the data.frame) our data is returned in. Because SPARQL is a graph query language, this makes it easy to construct a query that would be cumbersome in SQL or even other recursive tree queries like jq, which both would require some knowledge of how the stored data is organized.

To illustrate this, consider the following example. NSF asks me to list all of my co-authors within the past four years as conflicts of interest (COI). Here is a query that for all papers where I am an author, returns a table of given name, family name and year of publication:

ex <- system.file("extdata/vita.json", package="rdflib")
vita <- rdf_parse(ex, "jsonld")

sparql <-
 'PREFIX schema: <http://schema.org/>

  SELECT ?coi_given ?coi_family ?year

  WHERE { 
    ?paper a schema:ScholarlyArticle . 
    ?paper schema:author ?authors .
    ?paper schema:dateCreated ?year . 
    ?authors schema:familyName ?coi_family .
    OPTIONAL { ?authors schema:givenName ?coi_given . }

    FILTER ( ?coi_family != "Boettiger" )
}
'

coi <- vita %>% rdf_query(sparql)

DT::datatable(coi)

Now we have rectangular data, we can tidy things up a bit more. (In principle I believe we could have done this in SPARQL as well)

coi2 <- 
  coi %>% 
  ## join names, year as Date
  mutate(year = as.Date(year), name = paste0(coi_family, ", ", coi_given)) %>% 
  ## interaction in last 4 years
  filter(year > lubridate::today() - lubridate::years(3)) %>% 
  ## For each person, list only most recent date
  group_by(name) %>% 
  summarise(year = max(year))

DT::datatable(coi2)

Yay, that's wasn't so bad. Rectangling JSON without a graph query is not as easy. This can be immensely frustrating to do using basic iteration operations with purrr, even with it's powerful syntax. Rectangling is a bit better with a tree-based query language, like a jq query. The only limitation here is that we have to know just a little about how our data is structured, since there are multiple tree structures that correspond to the same graph (or we could just use JSON-LD frame first, but that adds another step in the puzzle.)

Here's the same extraction on the same data, but with a jq query:

coi_jq <- 
  readr::read_file(ex) %>%
  jqr::jq(
     '."@reverse".author[]  | 
       { year: .dateCreated, 
         author: .author[] | [.givenName, .familyName]  | join(" ")
       }') %>%
  jqr::combine() %>%
  jsonlite::fromJSON()

coi3 <- 
  coi_jq %>% 
  mutate(year = as.Date(year)) %>%
  filter(year > lubridate::today() - lubridate::years(3)) %>% 
  filter(!grepl("Boettiger", author)) %>%
  ## For each person, list only most recent date
  group_by(author) %>% 
  summarise(year = max(year))

DT::datatable(coi3)

Turning RDF-XML into more friendly JSON

rdflib can also be useful as quick and lossless way to convert parse common data formats (e.g. RDF XML) into something more R-friendly. In this vignette, we illustrate how this might work in a simple example of some citation data returned in RDF-XML format from CrossRef.

Let's begin by reading in some RDF/XML data from CrossRef by querying a DOI and requesting rdf+xml MIME type (via Content Negotiation):

xml <- "ex.xml"

"https://doi.org/10.1002/ece3.2314" %>%
  httr::GET(httr::add_headers(Accept="application/rdf+xml")) %>%
  httr::content(as = "parsed", type = "application/xml") %>%
  xml2::write_xml(xml)

xml<- system.file("extdata/ex.xml", package="rdflib")

Our rdflib functions perform the simple task of parsing this rdfxml file into R (as a redland rdf class object) and then writing it back out in jsonld serialization:

rdf_parse(xml, "rdfxml") %>% 
  rdf_serialize("ex.json", "jsonld")

and we now have JSON file. We can clean this file up a bit by replacing the long URIs with short prefixes by "compacting" the file into a specific JSON-LD context. FOAF, OWL, and Dublin Core are all recognized by schema.org, so we need not declare them at all here. PRISM and BIBO ontologies are not, so we simply declare them as additional prefixes:

context <- 
'{ "@context": [
    "http://schema.org",
  {
    "prism": "http://prismstandard.org/namespaces/basic/2.1/",
    "bibo": "http://purl.org/ontology/bibo/"
  }]
}'
json <- jsonld_compact("ex.json", context)