knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(0x7bae)
blueprintr
is a framework for managing your data assets in a reproducible fashion. While it uses targets or drake, it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.
# install.packages("blueprintr", repos = "https://nyuglobalties.r-universe.dev") library(blueprintr)
blueprintr
provides your data with guardrails typically found in software engineering workflows.
This allows you to test and document before deploying to production.
The top level of the blueprintr
workflow is a "blueprints" directory, consisting of .R
and .csv
files.
Each blueprint has two components to it:
.R
file that instructs drake or targets on how to build a specific dataset. .csv
file that incorporates any mapping files and checks that need to be done on the dataset. In order to create a blueprint, we use the blueprint
function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset).
A project may need only a few blueprints, but more likely you'll need nested blueprints to transform the data.
blueprintr
generates six "steps" (targets) per blueprint:
Target name | Description
------------------------|--------------
{blueprint}_initial
| The result of running the blueprint's command
{blueprint}_blueprint
| A copy of the blueprint to be used throughout the plan
{blueprint}_meta
| A copy of the dataset metadata --- if the metadata file doesn't exist, it will be created in this step
{blueprint}_meta_path
| Creates the metadata file or loads it
{blueprint}_checks
| Runs all tests on the {blueprint}_initial
target
{blueprint}
| The built dataset after running some cleanup tasks
Let's take a well known dataset -- mtcars
, and create a blueprint for it.
# Keeping the row names under the column `rn` our_mtcars <- mtcars |> tidytable::as_tidytable(rownames = "rn") # Inspecting our mtcars dataset head(our_mtcars)
When we ingest data from various sources, it's usually helpful to outline the expected metadata for the sources. At TIES, we document this metadata in a user-created "mapping file." This mapping file acts as a map for any variable name changes, as well as categorical variable coding changes.
mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE) # Read this csv file: item_mapping <- mapping_file |> readr::read_csv( col_types = readr::cols( name_1 = readr::col_character(), description_1 = readr::col_character(), coding_1 = readr::col_character(), panel = readr::col_character(), homogenized_name = readr::col_character(), homogenized_coding = readr::col_character(), homogenized_description = readr::col_character() ) ) item_mapping
Then, we typically use a tool such as panelcleaner
to attach our mapping file to the mtcars
database.
This is a command executed in the dataset construction spec.
blueprint( "mt_cars", description = "mtcars database with attached metadata", annotate = TRUE, command = { pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |> panelcleaner::add_mapping(item_mapping) |> panelcleaner::homogenize_panel() |> panelcleaner::bind_waves() |> as.data.frame() pnl_name <- get_attr(pnl, "panel_name") pnl_mapping <- get_attr(pnl, "mapping") pnl <- pnl class(pnl) <- c("mapped_df", class(pnl)) set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name) } ) |> bp_include_panelcleaner_meta()
Save this script with a filename of your choice inside of the "blueprints" directory of your project. We'll assume you are using targets for your project:
./ _targets.R blueprints/ ... all blueprint R and CSV files go here ... R/ ... all associated R function definitions are here ... project.Rproj ...
When running this code with either targets
or drake
, the blueprint metadata is automatically created.
For our mtcars example, this looks like:
mtcars_metadata <- system.file("project/blueprints/example/homogenized.csv", package = "blueprintr", mustWork = TRUE) # Read this csv file: mtcars_metadata |> readr::read_csv()
Manually editing the metadata allows the user to add tests to check the data type and values.
The last step of our work is to load this blueprint into either targets or drake. For this example, we'll use targets as drake is deprecated. A full discussion of targets is beyond the scope of this vignette, but you can find an excellent walkthrough here. The only detail that is needed is to add blueprintr::tar_blueprints()
to your _targets.R
file:
# _targets.R library(targets) # ... list( tar_target( item_mapping, readr::read_csv("where/your/mapping/file/is/stored.csv") ), blueprintr::tar_blueprints(), # Other targets for your project! )
This will load all blueprints in the "blueprints" directory. If you have a nested directory structure, use blueprintr::tar_blueprints(recurse = TRUE)
.
And there you have it! You have created your first blueprint on the mtcars
dataset.
When running a pipeline with blueprintr
, the checks allow researchers to be warned of any issues at an early stage, allowing them to produce replicable results.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.