knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

set.seed(0x7bae)

blueprintr is a framework for managing your data assets in a reproducible fashion. While it uses targets or drake, it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.

Installation

# install.packages("blueprintr", repos = "https://nyuglobalties.r-universe.dev")

library(blueprintr)

Designed Use of blueprintr

blueprintr provides your data with guardrails typically found in software engineering workflows. This allows you to test and document before deploying to production.

The top level of the blueprintr workflow is a "blueprints" directory, consisting of .R and .csv files.

About blueprints

Each blueprint has two components to it:

In order to create a blueprint, we use the blueprint function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset).

A project may need only a few blueprints, but more likely you'll need nested blueprints to transform the data.

blueprintr generates six "steps" (targets) per blueprint:

Target name | Description ------------------------|-------------- {blueprint}_initial | The result of running the blueprint's command {blueprint}_blueprint | A copy of the blueprint to be used throughout the plan {blueprint}_meta | A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step {blueprint}_meta_path | Creates the metadata file or loads it {blueprint}_checks | Runs all tests on the {blueprint}_initial target {blueprint} | The built dataset after running some cleanup tasks

When writing other steps in your workflow (be it targets or drake), it is advised to **not** refer to the `{blueprint}_initial` step since it could have problems which are discovered in the `{blueprint}_checks` step.

Example

Let's take a well known dataset -- mtcars, and create a blueprint for it.

# Keeping the row names under the column `rn`
our_mtcars <- mtcars |> tidytable::as_tidytable(rownames = "rn")

# Inspecting our mtcars dataset
head(our_mtcars)

When we ingest data from various sources, it's usually helpful to outline the expected metadata for the sources. At TIES, we document this metadata in a user-created "mapping file." This mapping file acts as a map for any variable name changes, as well as categorical variable coding changes.

mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)

# Read this csv file:
item_mapping <- mapping_file |>
  readr::read_csv(
    col_types = readr::cols(
      name_1 = readr::col_character(),
      description_1 = readr::col_character(),
      coding_1 = readr::col_character(),
      panel = readr::col_character(),
      homogenized_name = readr::col_character(),
      homogenized_coding = readr::col_character(),
      homogenized_description = readr::col_character()
    )
  )
item_mapping

Then, we typically use a tool such as panelcleaner to attach our mapping file to the mtcars database. This is a command executed in the dataset construction spec.

blueprint(
  "mt_cars",
  description = "mtcars database with attached metadata",
  annotate = TRUE,
  command = {
    pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
      panelcleaner::add_mapping(item_mapping) |>
      panelcleaner::homogenize_panel() |>
      panelcleaner::bind_waves() |>
      as.data.frame()

    pnl_name <- get_attr(pnl, "panel_name")
    pnl_mapping <- get_attr(pnl, "mapping")

    pnl <-
      pnl

    class(pnl) <- c("mapped_df", class(pnl))
    set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
  }
) |>
  bp_include_panelcleaner_meta()

Save this script with a filename of your choice inside of the "blueprints" directory of your project. We'll assume you are using targets for your project:

./
  _targets.R
  blueprints/
    ... all blueprint R and CSV files go here ...
  R/
    ... all associated R function definitions are here ...
  project.Rproj
  ...
It is not required to use panelcleaner or even document the source metadata. This is just a convention we at TIES developed. However, we strongly advise doing something similar to track your data sources over time.

When running this code with either targets or drake, the blueprint metadata is automatically created. For our mtcars example, this looks like:

mtcars_metadata <- system.file("project/blueprints/example/homogenized.csv", package = "blueprintr", mustWork = TRUE)
# Read this csv file:
mtcars_metadata |>
  readr::read_csv()

Manually editing the metadata allows the user to add tests to check the data type and values.

The last step of our work is to load this blueprint into either targets or drake. For this example, we'll use targets as drake is deprecated. A full discussion of targets is beyond the scope of this vignette, but you can find an excellent walkthrough here. The only detail that is needed is to add blueprintr::tar_blueprints() to your _targets.R file:

# _targets.R
library(targets)

# ...

list(
  tar_target(
    item_mapping,
    readr::read_csv("where/your/mapping/file/is/stored.csv")
  ),

  blueprintr::tar_blueprints(),

  # Other targets for your project!
)

This will load all blueprints in the "blueprints" directory. If you have a nested directory structure, use blueprintr::tar_blueprints(recurse = TRUE).

And there you have it! You have created your first blueprint on the mtcars dataset. When running a pipeline with blueprintr, the checks allow researchers to be warned of any issues at an early stage, allowing them to produce replicable results.



nyuglobalties/blueprintr documentation built on July 16, 2024, 10:27 a.m.