knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
prov
lets you easily generate provenance records for common workflows using DCAT-2 and PROV-O ontologies, using the JSON-LD serialization for semantic (RDF) data.
The goal of prov
is to provide an index for automated workflows which use content-based storage. Storing data using a content-based system provides a convenient mechanism for data management: every unique version is automatically stored. Running the workflow repeatedly and generating identical results creates the same output file, but different results create different output. A simple implementation of such a system is to name each file based on the SHA-256 hash of its contents. The downside of content-based storage is that it can easily become difficult to keep track of what's what. prov
provides a metadata record to step into that gap. prov
provides a relatively high-level description of what scripts and what input data produced what results.
You can install the development version from GitHub with:
# install.packages("devtools") devtools::install_github("cboettig/prov")
library(prov)
We'll use temporary files for illustration only (CRAN policy). Really you would have paths to all these files (or URI identifiers for them).
input_data <- tempfile(fileext = ".csv") output_data <- tempfile(fileext = ".csv") code <- tempfile(fileext = ".R") p <- tempfile(fileext = ".json")
Here's a minimal workflow that generates output data from input data:
out <- lm(mpg ~ disp, data = mtcars) write.csv(mtcars, input_data) write.csv(out$coefficients, output_data)
Let's make a dummy .R
script just for this example.
writeLines("out <- lm(mpg ~ disp, data = mtcars)", code)
We are now ready to stick the pieces together into a provenance record!
write_prov(input_data, code, output_data, prov = p)
Let's take a look at the results:
writeLines(readLines(p))
If we include a title
, these will be grouped as to group these into a Dataset.
We use append=FALSE
to overwrite the previous record.
write_prov(input_data, code, output_data, prov = p, title = "example dataset with provenance", append= FALSE)
unlink(p)
A key feature of content-based provenance trace is that we can run this again and again:
If we use the same data and get the same result, all our objects have the same ID. By
default, these repeated calls append results to the prov
file, but the only new data
added indicates a new Activity
running the code at a different time, but generating
the same result.
out <- lm(mpg ~ disp, data = mtcars) write.csv(mtcars, input_data) write.csv(out$coefficients, output_data) write_prov(input_data, code, output_data, prov = p) out <- lm(mpg ~ disp, data = mtcars) write.csv(mtcars, input_data) write.csv(out$coefficients, output_data) write_prov(title = "An example dataset", input_data, code, output_data, prov = p)
Note that if write_prov
gets neither a title, description, or creator, it will not generate a Dataset element, but instead serialize a graph
of individual distribution
objects.
writeLines(readLines(p))
We can generate provenance data directly in Schema.org Dataset by setting the
optional schema
argument when writing to JSON-LD:
write_prov(title = "An example dataset", creator = list(givenName = "John", familyName = "Public"), input_data, code, output_data, prov = p, schema = "http://schema.org", append = FALSE)
Note that creator
must be given a named list with appropriate schema.org elements, as it is uninterpreted.
writeLines(readLines(p))
Schema Dot Org (sdo
) Dataset
, used in Google Dataset Search, is based upon the DCAT Dataset
concept, and many terms can crosswalk directly between DCAT2 and http://schema.org namespaces. prov
allows conversion between DCAT2 and Schema.org representation using an alternative context. We are not aware of a mapping for most PROV-O terms, but have based the mapping
around translating prov:Activity
to sdo:Action
, prov:used
to sdo:object
,
and prov:generated
to sdo:result
.
to_sdo(p)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.