describeWorkflow: Add data derivation information to a DataPackage

describeWorkflowR Documentation

Add data derivation information to a DataPackage

Description

Add information about the relationships among DataObject members in a DataPackage, retrospectively describing the way in which derived data were created from source data using a processing program such as an R script. These provenance relationships allow the derived data to be understood sufficiently for users to be able to reproduce the computations that created the derived data, and to trace lineage of the derived data objects. The method describeWorkflow will add provenance relationships between a script that was executed, the files that it used as sources, and the derived files that it generated.

Usage

describeWorkflow(x, ...)

## S4 method for signature 'DataPackage'
describeWorkflow(
  x,
  sources = list(),
  program = NA_character_,
  derivations = list(),
  insertDerivations = TRUE,
  ...
)

Arguments

x

The DataPackage to add provenance relationships to.

...

Additional parameters

sources

A list of DataObjects for files that were read by the program. Alternatively, a list of DataObject identifiers can be specified as a list of character strings.

program

The DataObject created for the program such as an R script. Alternatively the DataObject identifier can be specified.

derivations

A list of DataObjects for files that were generated by the program. Alternatively, a list of DataObject identifiers can be specified as a list of character strings.

insertDerivations

A logical value. If TRUE then the provenance relationship prov:wasDerivedFrom will be used to connect every source and derivation. The default value is TRUE.

Details

This method operates on a DataPackage that has had DataObjects for the script, data sources (inputs), and data derivations (outputs) previously added to it, or can reference identifiers for objects that exist in other DataPackage instances. This allows a user to create a standalone package that contains all of its source, script, and derived data, or a set of data packages that are chained together via a set of derivation relationships between the members of those packages.

Provenance relationships are described following the the ProvONE data model, which can be viewed at https://purl.dataone.org/provone-v1-dev. In particular, the following relationships are inserted (among others):

  • prov:used indicates which source data was used by a program execution

  • prov:generatedBy indicates which derived data was created by a program execution

  • prov:wasDerivedFrom indicates the source data from which derived data were created using the program

See Also

The R 'recordr' package for run-time recording of provenance relationships.

Examples

library(datapack)
dp <- new("DataPackage")
# Add the script to the DataPackage
progFile <- system.file("./extdata/pkg-example/logit-regression-example.R", package="datapack")
progObj <- new("DataObject", format="application/R", filename=progFile)
dp <- addMember(dp, progObj)

# Add a script input to the DataPackage
inFile <- system.file("./extdata/pkg-example/binary.csv", package="datapack") 
inObj <- new("DataObject", format="text/csv", filename=inFile)
dp <- addMember(dp, inObj)

# Add a script output to the DataPackage
outFile <- system.file("./extdata/pkg-example/gre-predicted.png", package="datapack")
outObj <- new("DataObject", format="image/png", file=outFile)
dp <- addMember(dp, outObj)

# Add the provenenace relationshps, linking the input and output to the script execution
# Note: 'sources' and 'derivations' can also be lists of "DataObjects" or "DataObject' identifiers
dp <- describeWorkflow(dp, sources = inObj, program = progObj, derivations = outObj) 
# View the results
utils::head(getRelationships(dp))

datapack documentation built on June 11, 2022, 1:05 a.m.