In ijlyttle/steward: Helps you manage metadata for published datasets

knitr::opts_chunk$set(
  collapse = FALSE,
  comment = ""
)

# https://community.rstudio.com/t/showing-only-the-first-few-lines-of-the-results-of-a-code-chunk/6963/2
#
hook_output <- knitr::knit_hooks$get("output")
knitr::knit_hooks$set(
  output = function(x, options) {
    lines <- options$output.lines
    if (is.null(lines)) {
      return(hook_output(x, options))  # pass to default hook
    }
    x <- unlist(strsplit(x, "\n"))
    more <- "..."
    if (length(lines)==1) {        # first n lines
      if (length(x) > lines) {
        # truncate the output, but add ....
        x <- c(head(x, lines), more)
      }
    } else {
      x <- c(more, x[lines], more)
    }
    # paste these lines together
    x <- paste(c(x, ""), collapse = "\n")
    hook_output(x, options)
  }
)

library("steward")

The goal of this package is to make it a little easier to build and publish data-dictionaries, in particular:

for package documentation.
to produce nicely-formatted data-dictionaries in RMarkdown documents.

In this article, we will look at a few (what we think are) common tasks:

create dataset
write dataset to package-documentation
write dataset to gt table

Create

For this set of examples, we use the ggplot2 diamonds dataset. We will create an stw_dataset, which is a data frame with an extra class attached to help manage metadata. An stw_dataset is a data frame, in the same way that a tibble is a data frame.

In the code that follows we have three steps:

Create the stw_dataset() using a data frame.
Add the top-level metadata using stw_mutate_meta().
Add the column-level metadata using stw_mutate_dict().

The functions stw_mutate_meta() and stw_mutate_dict() work a little bit like dplyr::mutate(). However, instead of mutating the columns of a data frame, stw_mutate_meta() mutates the top-level metadata, and stw_mutate_dict() mutates the descriptions of columns in a data frame:

diamonds_new <-
  stw_dataset(ggplot2::diamonds) %>%
  stw_mutate_meta(
    name = "diamonds",
    title = "Prices of 50,000 round cut diamonds",
    description = "A dataset containing the prices and other attributes of almost 54,000 diamonds.",
    sources = list(
      list(
        title = "DiamondSearchEngine",
        path = "http://www.diamondse.info/"
      )
    )
  ) %>%
  stw_mutate_dict(
    price = "price in US dollars ($326--$18,823)",
    carat = "weight of diamond (0.2--5.01)",
    cut = "quality of the cut (Fair, Good, Very Good, Premium, Ideal)",
    color = "diamond color, from D (best) to J (worst)",
    clarity = "a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))",
    x = "length in mm (0--10.74)",
    y = "width in mm (0--58.9)",
    z = "depth in mm (0--31.8)",
    depth = "total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)",
    table = "width of top of diamond relative to widest point (43--95)"
  )

We also offer a function to make some basic checks on the completeness of the metadata:

stw_check(diamonds_new, verbosity = "all")

Write to package

If you are including a dataset in a package, it is good practice to document it; in fact, CRAN insists!

You can use an stw_dataset to keep your documentation associated with your dataset as you build it for your package. By keeping all the "stuff" together, this helps assure that the dataset documentation is complete and current.

You can use the function stw_use_data() wraps usethis::use_data() and steward::stw_write_roxygen(). For example, by running: