knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(cli.width = 80)
library("steward")
library("conflicted")
library("dplyr")

Let's say you have two data frames, and you would like to know how compatible they are with each other. This problem had been floating around my head when I saw this tweet from Sharla Gelfand:

anyone aware of an #rstats package that will compare two data frames' names/types and output nice, descriptive errors if they don't match? do i have to build this myself? - @sharlagelfand

Sharla followed this up with a comprehensive blog post where she details what she wants to do, then shows how existing solutions perform. Happily, Sharla's view of the problem matches perfectly my view of the problem. What's more, Sharla has specified what she wants in a solution, with reprexes! As "calls to action" go, what's not to like?!

Here's my proposed solution to the problem, an experimental function called col_spec_compare() - it's called this because it's based on the implementation of column-specifications in readr and vroom.

NOTE: As of Feb. 2020, this is an experimental function in a maturing package which is not yet on CRAN. I am writing this up to get feedback on what such a function should do, rather than as an annoucement of a solution you should feel confident to use in your day-to-day work.

Following Sharla's lead, when we compare a data frame with itself, we get nothing but success:

col_spec_compare(mtcars, mtcars)

This function has two arguments x and y; these names are used to identify the data frames when there are differences. For example, what if there is a missing column?

mtcars_missing_cols <-
  mtcars %>%
  select(-mpg)

col_spec_compare(mtcars_missing_cols, mtcars)

You can see that the "missing" column is identified by name, mpg, then described by a column specification, col_double(). We see a second piece of information that tells us that for column names that are common to x and y, the specifications are identical.

The column specifications are defined in readr (and vroom). If fact, we are using a function in the dev version of readr to determine the column specifications for each of x and y (building on what the Tidyverse team have already done):

readr::as.col_spec(mtcars)

Resuming the examples, what if there is an extra column?

mtcars_extra_cols <-
  mtcars %>%
  mutate(mpgo = mpg)

col_spec_compare(mtcars_extra_cols, mtcars)

At this point it may be useful to "peek under the hood" a little bit (this article has a larger section on this). The function col_spec_compare() returns an object with S3 class col_spec_diff, which is just a way of saying "it's a list() with extra functionality." We can use the str() function to have a closer look:

diff_extra_cols <- col_spec_compare(mtcars_extra_cols, mtcars)

str(diff_extra_cols)

This list describes the match between the columns of data frame x and data frame y. It has a print method that provides (hopefully) user-friendly and informative feedback.

print(diff_extra_cols)

The observant reader will have noticed that col_spec_compare() gives us feedback, but it does not act on any of its information. It does not throw errors or issue warnings. This is because I have no idea what an acceptable match means to you in your situation. Maybe it's OK that there's an extra column, maybe not? I'll talk about this more later in this article.

What if we have both a missing column and and extra column?

mtcars_missing_extra_cols <-
  mtcars %>%
  select(-cyl) %>%
  mutate(mpgo = mpg)

col_spec_compare(mtcars_missing_extra_cols, mtcars)

Let's say that the columns are ordered differently:

mtcars_diff_order <-
  mtcars %>%
  select(cyl, everything())

col_spec_compare(mtcars_diff_order, mtcars)

Or if a column has the same name, but has a different class:

mtcars_wrong_class <-
  mtcars %>%
  mutate(mpg = as.character(mpg))

col_spec_compare(mtcars_wrong_class, mtcars)

The feedback here focuses on those columns names that are common to both data frames, but have different specifications.

We have a couple more examples - first, Sharla has a "throw the kitchen sink at it" problem:

mtcars_missing_extra_cols_wrong_class <-
  mtcars %>%
  select(-cyl) %>%
  mutate(mpgo = mpg) %>%
  mutate(mpg = as.character(mpg))

col_spec_compare(mtcars_missing_extra_cols_wrong_class, mtcars)

I offer a final example myself, in the spirit of something I am likely to do:

col_spec_compare(mtcars, cars)

Under the hood

This function takes advantage of the column-specifications already defined in readr and vroom. This introduces a limitation, we can consider a column only if it is one of:

This covers quite a lot, so I don't think this is a huge limitation.

As discussed above, the col_spec_diff class is a list. Its values are either logicals or sets of column-specifications:

This function is separate from the rest of the infrastructure in the steward package. It is much closer to the infrastructure in readr and vroom; these packages are designed to read data, rather than to compare it. That being said, they both use the cols() infrastructure, and as best as I can tell, it is implemented independently in both packages.

Far be it from me to suggest anything to the Tidyverse team, but there seems to be an opportunity for another r-lib package, perhaps named colspec, that both readr and vroom could use. If such a package were to come to pass, perhaps this functionality could be part of such a package.

Extension

As discussed earlier in this article, the function col_spec_compare() does not take any action, it simply returns information on the comparison of two data frames.

However you might have a situation where you build a data frame, validate it against a reference (stopping if need be), then return your data-frame. Something like this:

mtcars_new <-
  mtcars_building_function() %>%
  validate_col_spec(mtcars) %>%
  do_more_stuff()

Here's a first idea on what validate_col_spec() might look like:

validate_col_spec(x, y, predicate = NULL) {

  is_identical <- function(diff) {
    diff$identical
  }

  predicate <- predicate %||% is_identical

  diff <- col_spec_diff(x, y)

  if (!predicate(diff)) {
    print(diff)
    stop("validation failed", call. = FALSE)
  }

  x
}

For this function, you could supply a predicate function, which would take a single argument, a col_spec_diff object. It would return a logical that would indicate if execution should proceed. If the predicate is not satisfied, the diff object would be printed and execution stopped.

We could also imagine specifying that a predicate function would return a list containing the logical result, and an error message to pass to stop().

In this scenario, "we" would provide the validate_col_spec() function; "you" would provide the predicate function.



ijlyttle/steward documentation built on Jan. 5, 2021, 2:25 p.m.