Introduction to testdat"
In testdat: Data Unit Testing for R

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", error = TRUE)
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(testdat)
library(dplyr)

testdat is a package designed to ease data validation, particularly for complex data processing, inspired by software unit testing. testdat extends the strong and flexible unit testing framework already provided by testthat with a family of functions and reporting tools focused on checking data frames.

Features include:

A fully fledged test framework so you can spend more time specifying tests and less time running them
A set of common methods for simply specifying data validation rules
Repeatability of data tests (avoid unintentionally breaking your data set!)
Data-focused reporting of test results

Getting started

As an extension of testthat, testdat uses the same basic testing framework. Before using testdat (and reading this documentation), make sure you're familiar with the introduction to testthat in R packages.

The main addition provided by testdat is a set of expectations designed for testing data frames and an accompanying mechanism to globally specify a data set for testing.

Data expectations

Overview

In general, our approach to data testing is variable-centric - the majority of expectations perform a check on one or more variables in a given data frame.

The standard form of a data expectation is:

expect_*(var(s), ..., flt = TRUE, data = get_testdata())

testdat uses dplyr at its core, and thus supports tidy evaluation.

Variables

Most operations act on one or more variables. There are two variants of the variable argument:

vars requires a set of columns specified as tidy selections.

test_that("multi-variable identifier is unique", {
  expect_unique(c(name, year, month, day, hour), data = storms)
})

var requires an unquoted variable name. This only applies to a small number of expectations.

test_that("hour values are valid", {
  expect_base(ts_diameter, year >= 2004)
})

Filter

The flt argument takes a logical predicate defined in terms of the variables in data using data masking. Only rows where the condition evaluates to TRUE are included in the test.

test_that("iris range checks", {
  expect_range(Petal.Width, 0, 1, data = iris)
})

test_that("iris range checks filtered", {
  # Test passes for setosa rows
  expect_range(Petal.Width, 0, 1, flt = Species == "setosa", data = iris)
  # Failures will provide the filter
  expect_range(Petal.Width, 0, 0.5, flt = Species == "setosa", data = iris)
})

Data

The data argument takes a data frame to test. To avoid redundant code the data argument defaults to a global test data set retrieved using get_testdata(). This can be used in two ways:

Setting the global test data using set_testdata().

set_testdata(iris)
identical(get_testdata(), iris)

test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})

Using the with_testdata() wrapper to temporarily set the global test data for a block of code.

set_testdata(mtcars)
identical(get_testdata(), mtcars)

with_testdata(iris, {
  test_that("Versicolor has sepal length greater than 5 - will fail", {
    expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
  })
})

identical(get_testdata(), mtcars)

Both approaches are equivalent to:

test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5, data = iris)
})

By default, set_testdata() stores a quosure with a reference to the provided data frame rather than the data itself, so changes made to the data frame will be reflected in the test results.

tmp_data <- tibble(x = c(1, 0), y = c(1, NA))

set_testdata(tmp_data)
print(get_testdata())
expect_base(y, x == 1)

tmp_data$y <- 1
print(get_testdata())
expect_base(y, x == 1)

`...`

Additional arguments are specific to the expectation. See the help page for the expectation function for details.

Using tests

Testing inside a script

The easiest way to use data testing is directly inside an R script. Expecations and test blocks throw an error if they fail, so it is very clear to the user that something needs to be checked, and the script will fail when sourced.

The example below shows how to adapt a script from exploratory "print and check" to a testing approach, using a simple data processing exercise.

Print and check

library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

# check id is unique
x %>% filter(duplicated(id))

# check values
x %>% filter(!pcode %in% 2000:3999)
x %>% count(state)
x %>% count(nsw_only)

# check base for nsw_only variable
x %>% filter(state != "NSW") %>% count(nsw_only)

x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
                                     pcode %in% 3000:3999 ~ 2))

x %>% count(market)

testdat

library(testdat)
library(dplyr)

x <- tribble(
  ~id, ~pcode, ~state, ~nsw_only,
  1,   2000,   "NSW",  1,
  2,   3123,   "VIC",  NA,
  3,   2123,   "NSW",  3,
  4,   12345,  "VIC",  3
)

with_testdata(x, {
  test_that("id is unique", {
    expect_unique(id)
  })

  test_that("variable values are correct", {
    expect_values(pcode, 2000:2999, 3000:3999)
    expect_values(state, c("NSW", "VIC"))
    expect_values(nsw_only, 1:3) # by default expect_values allows NAs
  })

  test_that("filters applied correctly", {
    expect_base(nsw_only, state == "NSW")
  })
})

x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
                                     pcode %in% 3000:3999 ~ 2))

with_testdata(x, {
  test_that("market derived correctly", {
    expect_values(market, 1:2, miss = NULL) # miss = NULL excludes NAs from valid values
  })
})

Note that:

The global test data can be set with set_testdata() or with_testdata().
The test_that() wrapper is not strictly necessary, but it provides more informative error messaging and logically groups a set of expectations.
Each with_testdata() block will fail on the first test_that() block that fails. The "filters applied correctly" test block is not run in the example.

Using a test suite

Setting up a proper project test suite can be useful for automatically validating final processed data sets.

Setting up proper file based testing infrastructure is out of the scope of this vignette. See R packages for a brief introduction to testing infrastructure.

In testthat, related tests are grouped into files. When using testdat, each file should have a test data set specified with a call to set_testdata().

set_testdata(iris)

test_that("Variable format checks", {
  expect_regex(Species, "^[a-z]+$")
})

test_that("Versicolor has sepal length greater than 5 - will fail", {
  expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})

Any scripts or data that you put into this service are public.

testdat documentation built on Sept. 4, 2023, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

testdat
Data Unit Testing for R

Introduction to testdat"
In testdat: Data Unit Testing for R

Getting started

Data expectations

Overview

Variables

Filter

Data

`...`

Categories

Using tests

Testing inside a script

Print and check

testdat

Using a test suite

Try the testdat package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

testdat Data Unit Testing for R

Introduction to testdat" In testdat: Data Unit Testing for R

Getting started

Data expectations

Overview

Variables

Filter

Data

...

Categories

Using tests

Testing inside a script

Print and check

testdat

Using a test suite

Try the testdat package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

testdat
Data Unit Testing for R

Introduction to testdat"
In testdat: Data Unit Testing for R

`...`