In cboettig/prov: A Semantic Data Provenance Index

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

prov

prov lets you easily generate provenance records for common workflows using DCAT-2 and PROV-O ontologies, using the JSON-LD serialization for semantic (RDF) data.

Motivation

The goal of prov is to provide an index for automated workflows which use content-based storage. Storing data using a content-based system provides a convenient mechanism for data management: every unique version is automatically stored. Running the workflow repeatedly and generating identical results creates the same output file, but different results create different output. A simple implementation of such a system is to name each file based on the SHA-256 hash of its contents. The downside of content-based storage is that it can easily become difficult to keep track of what's what. prov provides a metadata record to step into that gap. prov provides a relatively high-level description of what scripts and what input data produced what results.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("cboettig/prov")

Quickstart

library(prov)

We'll use temporary files for illustration only (CRAN policy). Really you would have paths to all these files (or URI identifiers for them).

input_data <- tempfile(fileext = ".csv")
output_data <- tempfile(fileext = ".csv")
code <- tempfile(fileext = ".R")
p <- tempfile(fileext = ".json")

Here's a minimal workflow that generates output data from input data:

out <- lm(mpg ~ disp, data = mtcars)

write.csv(mtcars, input_data)
write.csv(out$coefficients, output_data)

Let's make a dummy .R script just for this example.

writeLines("out <- lm(mpg ~ disp, data = mtcars)", code)

We are now ready to stick the pieces together into a provenance record!

write_prov(input_data, code, output_data, prov = p)

Let's take a look at the results:

writeLines(readLines(p))

If we include a title, these will be grouped as to group these into a Dataset. We use append=FALSE to overwrite the previous record.

write_prov(input_data, code, output_data, prov = p,
            title = "example dataset with provenance",  append= FALSE)

unlink(p)

Reproducible workflows

A key feature of content-based provenance trace is that we can run this again and again: If we use the same data and get the same result, all our objects have the same ID. By default, these repeated calls append results to the prov file, but the only new data added indicates a new Activity running the code at a different time, but generating the same result.

out <- lm(mpg ~ disp, data = mtcars)
write.csv(mtcars, input_data)
write.csv(out$coefficients, output_data)
write_prov(input_data, code, output_data, prov = p)


out <- lm(mpg ~ disp, data = mtcars)
write.csv(mtcars, input_data)
write.csv(out$coefficients, output_data)
write_prov(title = "An example dataset",
           input_data, code, output_data, prov = p)

Note that if write_prov gets neither a title, description, or creator, it will not generate a Dataset element, but instead serialize a graph of individual distribution objects.

writeLines(readLines(p))

Schema.org compatibility

We can generate provenance data directly in Schema.org Dataset by setting the optional schema argument when writing to JSON-LD:

write_prov(title = "An example dataset",
           creator = list(givenName = "John", familyName = "Public"),
           input_data, code, output_data, prov = p,
           schema = "http://schema.org", append = FALSE)

Note that creator must be given a named list with appropriate schema.org elements, as it is uninterpreted.

writeLines(readLines(p))

Schema Dot Org (sdo) Dataset, used in Google Dataset Search, is based upon the DCAT Dataset concept, and many terms can crosswalk directly between DCAT2 and http://schema.org namespaces. prov allows conversion between DCAT2 and Schema.org representation using an alternative context. We are not aware of a mapping for most PROV-O terms, but have based the mapping around translating prov:Activity to sdo:Action, prov:used to sdo:object, and prov:generated to sdo:result.