The comparedf function

knitr::opts_chunk$set(eval = TRUE, message = FALSE, results = 'asis', comment='')
options(width = 120)

Introduction

The comparedf() function can be used to determine and report differences between two data.frames. It was written in the spirit of replacing PROC COMPARE from SAS.

library(arsenal)

Why "comparedf"? We originally called this function compare.data.frame(), using testthat::compare() as our S3 generic, but that ended up getting us in trouble because of conflicting object structures. Why this didn't occur to us at the time remains a mystery. To replace it, we brainstormed several ideas (comparedf(), dfcompare(), collate(), comparison()) but settled on the former for three reasons:

  1. There were no other objects with that generic or class (see testthat::compare() and compare::compare()).

  2. It is mnemonically easy to remember (we "compare data.frames", not "data.frames compare").

  3. It tab auto-completes from the original "compare".

Basic examples

We first build two similar data.frames to compare.

df1 <- data.frame(id = paste0("person", 1:3),
                  a = c("a", "b", "c"),
                  b = c(1, 3, 4),
                  c = c("f", "e", "d"),
                  row.names = paste0("rn", 1:3),
                  stringsAsFactors = FALSE)
df2 <- data.frame(id = paste0("person", 3:1),
                  a = c("c", "b", "a"),
                  b = c(1, 3, 4),
                  d = paste0("rn", 1:3),
                  row.names = paste0("rn", c(1,3,2)),
                  stringsAsFactors = FALSE)

To compare these datasets, simply pass them to the comparedf() function:

comparedf(df1, df2)

Use summary() to get a more detailed summary

summary(comparedf(df1, df2))

By default, the datasets are compared row-by-row. To change this, use the by= or by.x= and by.y= arguments:

summary(comparedf(df1, df2, by = "id"))

A larger example

Let's muck up the mockstudy data.

data(mockstudy)
mockstudy2 <- muck_up_mockstudy()

We've changed row order, so let's compare by the case ID:

summary(comparedf(mockstudy, mockstudy2, by = "case"))

Column name comparison options

It is possible to change which column names are considered "the same variable".

Ignoring case

For example, to ignore case in variable names (so that Arm and arm are considered the same), pass tol.vars = "case".

You can do this using comparedf.control()

summary(comparedf(mockstudy, mockstudy2, by = "case", control = comparedf.control(tol.vars = "case")))

or pass it through the ... arguments.

summary(comparedf(mockstudy, mockstudy2, by = "case", tol.vars = "case"))

Treating dots and underscores the same (equivalence classes)

It is possible to treat certain characters or sets of characters as the same by passing a character vector of equivalence classes to the tol.vars= argument.

In short, each string in the vector is split into single characters, and the resulting set of characters is replaced by the first character in the string. For example, passing c("._") would replace all underscores with dots in the column names of both datasets. Similarly, passing c("aA", "BbCc") would replace all instances of "A" with "a" and all instances of "b", "C", or "c" with "B". This is one way to ignore case for certain letters. Otherwise, it's possible to combine the equivalence classes with ignoring case, by passing (e.g.) c("._", "case").

Passing a single character as an element this vector will replace that character with the empty string. For example, passing c(" ", ".") would remove all spaces and dots from the column names.

For mockstudy, let's treat dots, underscores, and spaces as the same, and ignore case:

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case") # dots=underscores=spaces, ignore case
))

Manually specifying columns to match together

If you pass a named vector to the tol.vars= argument, comparedf() will line up the names of that vector to the column names of x and the values of that vector to the column names of y. In this way, you can manually specify which non-identically-named columns to compare.

For mockstudy, let's specify our variables manually in this way:

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c(arm = "Arm", fu.stat = "fu stat", fu.time = "fu_time")
))

Column comparison options

Logical tolerance

Use the tol.logical= argument to change how logicals are compared. By default, they're expected to be equal to each other.

Numeric tolerance

To allow numeric differences of a certain tolerance, use the tol.num= and tol.num.val= options. tol.num.val= determines the maximum (unsigned) difference tolerated if tol.num="absolute" (default), and determines the maximum (unsigned) percent difference tolerated if tol.num="percent".

Also note the option int.as.num=, which determines whether integers and numerics should be compared despite their class difference. If TRUE, the integers are coerced to numeric. Note that mockstudy$ast is integer, while mockstudy2$ast is numeric:

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE            # compare integers and numerics
))

Suppose a tolerance of up to 10 is allowed for ast:

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10             # allow absolute differences <= 10
))

Factor tolerance

By default, factors are compared to each other based on both the labels and the underlying numeric levels. Set tol.factor="levels" to match only the numeric levels, or set tol.factor="labels" to match only the labels.

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels"        # match only factor labels
))

Also note the option factor.as.char=, which determines whether factors and characters should be compared despite their class difference. If TRUE, the factors are coerced to characters. Note that mockstudy$race is a character, while mockstudy2$race is a factor:

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE        # compare factors and characters
))

Character tolerance

Use the tol.char= argument to change how character variables are compared. By default, they are compared as-is, but they can be compared after ignoring case or trimming whitespace or both.

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE,       # compare factors and characters
                tol.char = "case"            # ignore case in character vectors
))

Date tolerance

Use the tol.date= argument to change how dates are compared. By default, they're expected to be equal to each other.

Other data type tolerances

Use the tol.other= argument to change how other objects are compared. By default, they're expected to be identical().

Specifying tolerances for each variable

You can also provide a list of tolerance functions to comparedf():

comparedf.control(tol.char = list(
  "none",      # the default
  x1 = "case", # be case-insensitive for the variable "x1"
  x2 = function(x, y) tol.NA(x, y, x != y | y == "NA") # a custom-defined tolerance
))

User-defined tolerance functions

Details

The comparedf.control() function accepts functions for any of the tolerance arguments in addition to the short-hand character strings. This allows the user to create custom tolerance functions to suit his/her needs.

Any custom tolerance function must accept two vectors as arguments and return a logical vector of the same length. The TRUEs in the results should correspond to elements which are deemed "different". Note that the numeric and date tolerance functions should also include a third argument for tolerance size (even if it's not used).

CAUTION: the results should not include NAs, since the logical vector is used to subset the input data.frames. The tol.NA() function is useful for considering any NAs in the two vectors (but not both) as differences, in addition to other criteria.

The tol.NA() function is used in all default tolerance functions to help handle NAs.

Example 1

Suppose we want to ignore any dates which are later in the second dataset than the first. We define a custom tolerance function.

my.tol <- function(x, y, tol)
{
  tol.NA(x, y, x > y)
}

date.df1 <- data.frame(dt = as.Date(c("2017-09-07", "2017-08-08", "2017-07-09", NA)))
date.df2 <- data.frame(dt = as.Date(c("2017-10-01", "2017-08-08", "2017-07-10", "2017-01-01")))
n.diffs(comparedf(date.df1, date.df2)) # default finds any differences
n.diffs(comparedf(date.df1, date.df2, tol.date = my.tol)) # our function identifies only the NA as different...
n.diffs(comparedf(date.df2, date.df1, tol.date = my.tol)) # ... until we change the argument order

Example 2

(Continuing our mockstudy example)

Suppose we're okay with NAs getting replaced by -9.

tol.minus9 <- function(x, y, tol)
{
  idx1 <- is.na(x) & !is.na(y) & y == -9
  idx2 <- tol.num.absolute(x, y, tol) # find other absolute differences
  return(!idx1 & idx2)
}

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE,       # compare factors and characters
                tol.char = "case",           # ignore case in character vectors
                tol.num = tol.minus9         # ignore NA -> -9 changes
))

Extract Differences

Differences can be easily extracted using the diffs() function. If you only want to determine how many differences were found, use the n.diffs() function.

cmp <- comparedf(mockstudy, mockstudy2, by = "case", tol.vars = c("._ ", "case"), int.as.num = TRUE)
n.diffs(cmp)
head(diffs(cmp))

Differences can also be summarized by variable.

diffs(cmp, by.var = TRUE)

To report differences from only a few variables, one can pass a list of variable names to diffs().

diffs(cmp, vars = c("ps", "ast"), by.var = TRUE)
diffs(cmp, vars = c("ps", "ast"))

Appendix

Stucture of the Object

(This section is just as much for my use as for yours!)

obj <- comparedf(mockstudy, mockstudy2, by = "case")

There are two main objects in the "comparedf" object, each with its own print method.

The frame.summary contains:

print(obj$frame.summary)

The vars.summary contains:

print(obj$vars.summary)


Try the arsenal package in your browser

Any scripts or data that you put into this service are public.

arsenal documentation built on June 5, 2021, 1:06 a.m.