knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
CanonicalForms
is an R package for ensuring that data sets conform to an expected format.
You can install the development version of CanonicalForms from GitHub with:
# install.packages("devtools") devtools::install_github("colin-fraser/CanonicalForms")
A CanonicalForm
allows you to check whether a dataset conforms to an expected format. The following code creates a simple CanonicalForm object for the dataset cars
, as well as a pair of copies of the dataset, one that conforms to the expected format and one that does not.
library(CanonicalForms) cf <- canonical_form( object_class = "data.frame", col_names = c("speed", "dist"), col_classes = c("numeric", "numeric") ) passing <- cars failing <- cars |> setNames(c("speed", "distance")) passing |> is_canonical(cf) # checking whether `passing` corresponds to the form specified in cf failing |> is_canonical(cf)
CanonicalForm
It can be a little bit tedious to type out the full canonical schema in the way shown above, especially for datasets with a large number of columns. For this reason, there is an extract_canonical_form
function which will use a dataset as a template to create a CanonicalForm
object, as well as a to_r_code
method that writes the boilerplate code for initializing a new CanonicalForm.
# the starwars dataset is a tibble with 14 columns starwars <- dplyr::starwars head(starwars) # this uses the starwars dataset as a template to extract a CanonicalForm cf <- extract_canonical_form(starwars) to_r_code(cf) # and this writes the boilerplate R code to construct that form
Suppose I have a pipeline that does the following transformations to the starwars
dataset.
library(dplyr) starwars_small <- starwars |> transmute(name, height, mass = as.integer(mass)) |> rename_with(toupper) swcf <- extract_canonical_form(starwars_small)
Now I have another script where I'm trying performing the same transformations. I can add a call to check_canonical
at the end of the transformations to make sure that the pipeline does what I expect. check_canonical
returns its input, but will raise a warning if the checks fail.
starwars_small_2 <- starwars |> select(name, height, mass) |> check_canonical(swcf)
You can also set it to raise an exception rather than a warning.
starwars_small_2 <- starwars |> select(NAME = name, HEIGHT = height, MASS = mass) |> check_canonical(swcf, behavior = 'stop')
If the pipeline returns the expected format, nothing visible will happen.
starwars_small_2 <- starwars |> transmute(NAME = name, HEIGHT = height, MASS = as.integer(mass)) |> check_canonical(swcf, behavior = 'stop') head(starwars_small_2)
By default, a newly created CanonicalForm
objects have three checks: they'll check that the type of dataset matches, the column names match, and the column types match. The package also provides other checks that can be run with is_canonical
and check_canonical
, and it's easy to write custom checks and add those as well.
# passing example swcf2 <- swcf |> add_checks( # check that no NAME values are NA: no_nas = check_no_nas(cols = c('NAME')), # check that HEIGHT and MASS are greater than 0: positive_values = check_greater_than(HEIGHT = 0,MASS = 0) ) starwars_small |> check_canonical(swcf2) # failing example swcf3 <- swcf |> add_checks(no_nas = check_no_nas(c("NAME", "HEIGHT", "MASS")), min_values = check_greater_than(HEIGHT = 1000, MASS = 0)) starwars_small |> check_canonical(swcf3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.