DataPackageR-package: DataPackageR

DataPackageR-packageR Documentation

DataPackageR

Description

A framework to automate the processing, tidying and packaging of raw data into analysis-ready data sets as R packages.

Details

DataPackageR will automate running of data processing code, storing tidied data sets in an R package, producing data documentation stubs, tracking data object finger prints (md5 hash) and tracking and incrementing a "DataVersion" string in the DESCRIPTION file of the package when raw data or data objects change. Code to perform the data processing is passed to DataPackageR by the user. The user also specifies the names of the tidy data objects to be stored, documented and tracked in the final package. Raw data should be read from "inst/extdata" but large raw data files can be read from sources external to the package source tree.

Configuration is controlled via the config.yml file created at the package root. Its properties include a list of R and Rmd files that are to be rendered / sourced and which read data and do the actual processing. It also includes a list of r object names created by those files. These objects are stored in the final package and accessible via the data() API. The documentation for these objects is accessible via "?object-name", and md5 fingerprints of these objects are created and tracked.

The Rmd and R files used to process the objects are transformed into vignettes accessible in the final package so that the processing is fully documented.

A DATADIGEST file in the package source keeps track of the data object fingerprints. A DataVersion string is added to the package DESCRIPTION file and updated when these objects are updated or changed on subsequent builds.

Once the package is built and installed, the data objects created in the package are accessible via the data() API, and Calling datapackage_skeleton() and passing in R / Rmd file names, and r object names constructs a skeleton data package source tree and an associated config.yml file.

Calling package_build() sets the build process in motion.

Author(s)

Maintainer: Dave Slager dslager@scharp.org (ORCID) [contributor]

Authors:

Other contributors:

  • Paul Obrecht [contributor]

  • Ellis Hughes ellishughes@live.com (ORCID) [contributor]

  • Jimmy Fulp williamjfulp@gmail.com [contributor]

  • Marie Vendettuoli (ORCID) [contributor]

  • Jason Taylor jmtaylor@fredhutch.org [contributor]

  • Kara Woo (Kara reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/230>) [reviewer]

  • William Landau (William reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/230>) [reviewer]

See Also

Useful links:

Examples

# A simple Rmd file that creates one data object
# named "tbl".
if(rmarkdown::pandoc_available()){
f <- tempdir()
f <- file.path(f,"foo.Rmd")
con <- file(f)
writeLines("```{r}\n tbl = data.frame(1:10) \n```\n",con=con)
close(con)

# construct a data package skeleton named "MyDataPackage" and pass
# in the Rmd file name with full path, and the name of the object(s) it
# creates.

pname <- basename(tempfile())
datapackage_skeleton(name=pname,
   path=tempdir(),
   force = TRUE,
   r_object_names = "tbl",
   code_files = f)

# call package_build to run the "foo.Rmd" processing and
# build a data package.
package_build(file.path(tempdir(), pname), install = FALSE)

# "install" the data package
devtools::load_all(file.path(tempdir(), pname))

# read the data version
data_version(pname)

# list the data sets in the package.
data(package = pname)

# The data objects are in the package source under "/data"
list.files(pattern="rda", path = file.path(tempdir(),pname,"data"), full = TRUE)

# The documentation that needs to be edited is in "/R"
list.files(pattern="R", path = file.path(tempdir(), pname,"R"), full = TRUE)
readLines(list.files(pattern="R", path = file.path(tempdir(),pname,"R"), full = TRUE))
# view the documentation with
?tbl
}

RGLab/DataPackageR documentation built on April 19, 2024, 7:33 p.m.