knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The purpose of diffdf
is to provide proc compare
like functionality to R for use in second line programming. In particular we focus on raising warnings if any differences are found whilst providing in-depth diagnostics to highlight where these differences have occurred.
Here we show the basic functionality of diffdf
using a dummy data set.
library(diffdf) LENGTH = 30 suppressWarnings(RNGversion("3.5.0")) set.seed(12334) test_data <- tibble::tibble( ID = 1:LENGTH, GROUP1 = rep( c(1,2) , each = LENGTH/2), GROUP2 = rep( c(1:(LENGTH/2)), 2 ), INTEGER = rpois(LENGTH , 40), BINARY = sample( c("M" , "F") , LENGTH , replace = T), DATE = lubridate::ymd("2000-01-01") + rnorm(LENGTH, 0, 7000), DATETIME = lubridate::ymd_hms("2000-01-01 00:00:00") + rnorm(LENGTH, 0, 200000000), CONTINUOUS = rnorm(LENGTH , 30 , 12), CATEGORICAL = factor(sample( c("A" , "B" , "C") , LENGTH , replace = T)), LOGICAL = sample( c(TRUE , FALSE) , LENGTH , replace = T), CHARACTER = stringi::stri_rand_strings(LENGTH, rpois(LENGTH , 13), pattern = "[ A-Za-z0-9]") ) test_data diffdf( test_data , test_data)
As you would expect no differences are found. We now look to introduce various types differences into the data in order to show how diffdf
highlights them. Note that for the purposes of this vignette we have used the suppress_warnings
argument to stop errors being raised; it is recommended however that this option is not used in production code as it may mask problems.
test_data2 <- test_data test_data2 <- test_data2[,-6] diffdf(test_data , test_data2 , suppress_warnings = T)
test_data2 <- test_data test_data2 <- test_data2[1:(nrow(test_data2) - 2),] diffdf(test_data, test_data2 , suppress_warnings = T)
test_data2 <- test_data test_data2[5,2] <- 6 diffdf(test_data , test_data2 , suppress_warnings = T)
test_data2 <- test_data test_data2[,2] <- as.character(test_data2[,2]) diffdf(test_data , test_data2 , suppress_warnings = T)
test_data2 <- test_data attr(test_data$ID , "label") <- "This is a interesting label" attr(test_data2$ID , "label") <- "what do I type here?" diffdf(test_data , test_data2 , suppress_warnings = T)
test_data2 <- test_data levels(test_data2$CATEGORICAL) <- c(1,2,3) diffdf(test_data , test_data2 , suppress_warnings = T)
A key feature of diffdf
that enables easier diagnostics is the ability to specify which variables form a unique row i.e. which rows should be compared against each other based upon a key. By default if no key is specified diffdf
will use the row numbers as the key however in general this isn't recommended as it means two identical datasets simply sorted differently can lead to incomprehensible error messages as every observation is flagged as different. In diffdf
keys can be specified as character vectors using the keys
argument.
test_data2 <- test_data test_data2$INTEGER[c(5,2,15)] <- 99L diffdf( test_data , test_data2 , keys = c("GROUP1" , "GROUP2") , suppress_warnings = T)
As an additional utility diffdf
comes with the function diffdf_issuerows()
which can be used to subset your dataset against the issue object to return just the rows that are flagged as containing issues.
iris2 <- iris for (i in 1:3) iris2[i,i] <- 99 diff <- diffdf( iris , iris2, suppress_warnings = TRUE) diffdf_issuerows( iris , diff) diffdf_issuerows( iris2 , diff)
Bear in mind that the vars
option can be used to just subset down to issues associated with particular variables.
diffdf_issuerows( iris2 , diff , vars = "Sepal.Length") diffdf_issuerows( iris2 , diff , vars = c("Sepal.Length" , "Sepal.Width"))
Sometimes it can be useful to use the comparison result to fuel further checks or programming logic. To assist with this diffdf
offers two pieces of functionality namely the suppress_warnings
argument (which has already been shown) and the diffdf_has_issues()
helper function which simply returns TRUE if differences have been found else FALSE.
iris2 <- iris for (i in 1:3) iris2[i,i] <- 99 diff <- diffdf( iris , iris2, suppress_warnings = TRUE) diffdf_has_issues(diff)
if ( diffdf_has_issues(diff)){ #<Further programming steps / logic> }
You can use the tolerance
argument of diffdf
to define how sensitive the comparison should be to decimal place inaccuracies. This important as very often floating point numbers will not compare equal due to machine rounding as they cannot be perfectly represented in binary. By default tolerance is set to sqrt(.Machine$double.eps)
dsin1 <- data.frame(x = 1.1e-06) dsin2 <- data.frame(x = 1.1e-07) diffdf(dsin1 , dsin2 , suppress_warnings = T) diffdf(dsin1 , dsin2 , tolerance = 0.001 , suppress_warnings = T)
By default, the function will note a difference between integer and double columns, and factor and character columns. It can be useful in some contexts to prevent this from occuring. We can do so with the strict_numeric = FALSE
and strict_factor = FALSE
arguments.
dsin1 <- data.frame(x = as.integer(c(1,2,3))) dsin2 <- data.frame(x = as.numeric(c(1,2,3))) diffdf(dsin1 , dsin2 , suppress_warnings = T) diffdf(dsin1 , dsin2 , suppress_warnings = T, strict_numeric = FALSE) dsin1 <- data.frame(x = as.character(c(1,2,3)), stringsAsFactors = FALSE) dsin2 <- data.frame(x = as.factor(c(1,2,3))) diffdf(dsin1 , dsin2 , suppress_warnings = T) diffdf(dsin1 , dsin2 , suppress_warnings = T, strict_factor = FALSE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.