This article describes how glitter works, and why. At this stage of glitter history, feedback and feature requests are most welcome!

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(glitter)

The glitter package helps writing SPARQL queries by implementing an internal domain-specific language: with glitter, you write code that mostly looks like R code, and end up with a SPARQL query. For instance:

library("glitter")
query <- spq_init() %>%
  spq_add("?item wdt:P31 wd:Q13442814") %>%
  spq_add("?item rdfs:label ?itemTitle") %>%
  spq_filter(str_detect(str_to_lower(itemTitle), 'wikidata')) %>%
  spq_filter(lang(itemTitle) == "en") %>%
  spq_head(n = 5)

query

The R code should therefore be easier to write, and read. The function names and syntax are meant to remind of the tidyverse, and of base R.

Code using glitter will feature:

glitter query object

The query object is a list with elements such as the variables (vars), filters, etc. Later we might make it an actual class, maybe an R6 one?

It is built by the different calls to spq_ functions. The SPARQL query string is assembled by spq_assemble(). Later we might add some linting at that stage.

glitter tooling

Under the hood, glitter uses

More details in the next sections.

spq_add()

spq_add() works differently from the other spq_ functions because it looks closer to SPARQL.

Clearly something like spq_add(query, "?item wdt:P31 wd:Q13442814") does not look like R code. The motivation for this is:

Triple patterns are parsed by decompose_triples that uses string manipulation.

Now, if one wants to go full DSL, it is possible, via spq_filter() and spq_mutate().

The triple pattern in spq_add(query, "?item wdt:P31 wd:Q13442814") means finding items that are an instance of ("wdt:P31") of a scholarly article ("wd:Q13442814"). With glitter, you can also write it

spq_init() %>%
  spq_filter(item == wdt::P31(wd::Q13442814))

This looks more like a normal tidyverse pipeline. Note that the namespacing here is done the R way i.e. wdt::P31 as opposed to "wdt:P31".

Similary,

spq_init() %>%
  spq_add("wd:Q331676 wdt:P1843 ?statement")

adds a variable that is "wdt:P1843" of Sonchus oleraceus ("wd:Q331676"). It can be written:

spq_init() %>%
  spq_mutate(statement = wdt::P1843(wd::Q331676))

Other spq_ functions

The other spq_ functions spq_arrange(), spq_select(), spq_mutate(), spq_mutate() , spq_filter(), spq_summarize() are the core of the DSL.

They have ... as arguments where three different things can be passed:

The names of their other arguments starts with a dot to prevent name clashes.

How do we differentiate these three things that users can pass?

head(glitter::all_correspondences)

So all instances of n(blabla) become COUNT(blabla). We also transform argument names. Look at the "SELECT" statement below, the str_c() function becomes GROUP_CONCAT() and its argument SEPARATOR. Also note that the argument comes after a colon, not a comma like in R.

spq_init() %>%
  spq_summarise(authors = str_c(name, sep = ', '))

Later, we need to document these correspondences better, and we need to stress test the DSL with more cases using arguments.

Special case of spq_filter() and spq_mutate()

spq_filter() receives R-looking fragments that are translated into SPARQL snippets for FILTER... or triple patterns. spq_mutate() receives R-looking fragments that are translated into SPARQL snippets for SELECT... or triple patterns.

At the moment the detection of which is which is based on ::: if the R-looking fragment contains ::, we assume it will become a triple pattern. Later, we need to make this more robust as the function spq_set() makes it easier to create synonyms for any subject/verb/object via SPARQL VALUES.

When we assume spq_filter()/spq_mutate() has received an R-looking fragment meant to be translated to a triple pattern, it is parsed so, not forgetting the order is not the same in the two cases:

R CMD check hack

The examples using something like

spq_init() %>%
  spq_filter(item == wdt::P31(wd::Q13442814))

got flagged as if wdt were a dependency to be stated. This is understandable. To bypass it these examples are not examples, they are R chunks in a section called "Some examples". This means they aren't checked. Thankfully we have similar code in the real tests!

Future work

The issue tracker of glitter is quite representative of future work, as well as all sentences starting with "Later" in this article. As stated at the very beginning of this article, your ideas and comments are welcome.



lvaudor/glitter documentation built on Jan. 30, 2024, 1:34 a.m.