In ezwelty/dpkg: Data Packages for R

knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()
library(magrittr)

An R package to read, write, and edit Data Package data and metadata. Unlike other existing R packages dpmr and datapkg, dpkg can be used to build and document Data Packages entirely within R. Please note that this is a work in progress and function naming and functionality may drift based on feedback from the community.

This package is not on CRAN. To install in R, use devtools:

devtools::install_github("ezwelty/dpkg")

Quick introduction

To build a data package, assemble the data and add metadata to the various elements:

data <- data.frame(
  id = 1L %>% set_field(title = "Identifier"),
  value = 1.1,
  added = Sys.Date()
)
# Data Resource (list of Fields)
dr <- data %>%
  set_resource(
    name = "data",
    path = "data/data.csv"
  )
# Data Package (list of Resources)
dp <- list(dr) %>%
  set_package(
    name = "data-package"
  )

You can preview the package metadata:

get_package(dp) %>% str()

Write the package to file:

dir <- tempdir()
write_package(dp, path = dir)

And read the package back in:

read_package(dir)

Build a package

In dpkg, the contents of a data package is stored as a list of one or more data resources (each a list) of one or more fields (each typically an atomic vector). For example:

dp <- list(
  dr = data.frame(
    id = 1L,
    value = 1.1,
    added = Sys.Date()
  )
)

Package, resource, and field ("data objects") metadata can be set or updated using the set_* functions (set_package, set_resource, set_field), which come in a <- flavor:

set_field(dp$dr$id) <- field(title = "Unique identifier", constraints = constraints(unique = TRUE))

and a pipe-friendly flavor:

dp$dr$id %<>% set_field(title = "Identifier", constraints = NULL)

As seen above with the use of field and constraints, a suite of helper functions are available to assist in the building of metadata:

Data objects: package, resource, field
Meta objects: schema, foreignKey, constraints, license, source, contributor

Data object metadata is stored as attributes. Although in base R attributes are lost in many common operations, this package provides protection from this by making metadata resilient to [, [[, subset, and append.

Preview a package

To preview a package, metadata can be retrieved from data objects using the get_* functions (get_package, get_resource, get_field). Missing properties are filled with their default values:

Fields
- name: The name of the object in a list (resource).
- type: The type corresponding to the object class.
  - character -> "string"
  - numeric -> "number"
  - integer -> "integer"
  - logical -> "boolean"
  - Date -> "date"
  - POSIXt -> "datetime"
  - otherwise -> "string"
- format: The default format for that type.
  - date -> "%Y-%m-%d"
  - datetime -> "%Y-%m-%dT%H-%M-%SZ"
  - unit: Units set by units deparsed to product power form.
Resources
- name: The name of the object in a list (package).
- schema$fields: Field metadata from the elements of the object.
Packages
- resources: Resource metadata from the elements of the object.

get_field(dp$dr$id) %>% str()
get_resource(dp$dr) %>% str()
get_package(dp) %>% str()

Write a package

write_package writes package data and metadata to disk using the following rules for each resource:

format: If missing, checks path file extension and mediatype. Only "csv" ("text/csv") and "json" ("application/json") are supported.
path: If not set, the data is saved in the metadata (datapackage.json) as either an inline JSON object (format: "json" or missing) or a CSV string (format: "csv"). For writing, path must be a single, local, relative path.

tmpdir <- file.path(tempdir(), "example")

Resource as an inline JSON object:

set_resource(dp$dr) <- package(format = "json", path = NULL)
get_resource(dp$dr)$data
write_package(dp, path = tmpdir)
list.files(tmpdir)

unlink(tmpdir, recursive = TRUE)

Resource as an inline CSV string:

set_resource(dp$dr) <- package(format = "csv", path = NULL)
get_resource(dp$dr)$data
write_package(dp, path = tmpdir)
list.files(tmpdir)

unlink(tmpdir, recursive = TRUE)

Resource as a JSON file:

set_resource(dp$dr) <- package(format = "json", path = "data/data.json")
get_resource(dp$dr)$data
write_package(dp, path = tmpdir)
list.files(tmpdir, recursive = TRUE)

unlink(tmpdir, recursive = TRUE)

Resource as a CSV file:

set_resource(dp$dr) <- package(format = "csv", path = "data/data.csv")
get_resource(dp$dr)$data
write_package(dp, path = tmpdir)
list.files(tmpdir, recursive = TRUE)

unlink(tmpdir, recursive = TRUE)

Read a package

read_package reads package data and metadata into the same structure described above, but unlike write_package, it supports both local and remote paths. The resources argument can be used to read a subset of the package's resources (or all if NULL, the default).

dp <- read_package(
  "https://raw.githubusercontent.com/columbia-glacier/optical-surveys-1985/master",
  resources = c("station", "velocity")
)
get_package(dp) %>% str()
dp$station
head(dp$velocity)

read_package_github accepts a shorthand GitHub repository address.

dp <- read_package_github("columbia-glacier/optical-surveys-1985", "station")

TODO

Fields

Only types string, number, integer, boolean, date, and datetime are implemented (see table-schema/field-descriptors). Add support for the remaining types:

[ ] type = object
[ ] type = array
[ ] type = time (via package hms)
[ ] type = year (already supported via type = date and format = "%Y")
[ ] type = yearmonth (already supported via type = date and format = "%Y-%m")
[ ] type = duration (already supported via type = numeric and unit)
[ ] type = geopoint
[ ] type = geojson

Additionally:

[ ] Validate field values against constraints property
[ ] For type = string, validate values against format property

Resources & Packages

[ ] Validate packages, resources, and schemas against standard (https://specs.frictionlessdata.io/schemas/registry.json) or custom profiles.
[ ] Read/write resources with a path like "data/data.csv.gz" to/from compressed files
[ ] Write resource schemas with a path to a JSON file
[ ] Read/write GeoJSON and TopoJSON to/from spatial objects
[ ] Validate license name against http://licenses.opendefinition.org/licenses/groups/all.json
[ ] Support reading packages based on their Data Package Identifier