knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

R build status lifecycle Coverage Status CRAN_Status_Badge metacran downloads

sanityTracker

An R-Package that keeps track of all performed sanity checks.

During the preparation of data set(s) one usually performs some sanity checks. The idea is that irrespective of where the checks are performed, they are centralized by this package in order to list all at once with examples if a check failed.

Example

Assume you process a data set and you have different functions for certain aspects. Within those functions you can make some checks, document what you did, what the outcome of the check was and store some examples where the check failed. At the end you can summarize the performed checks (that might be scattered all over our source code) and their outcomes.

raw_data <- data.frame(
  id = 1:4, 
  start = c("2020-04-12", "2010-01-20", "2020-02-20", "2020-01-23"),
  end = c("2020-03-13", "2020-01-26", "2020-03-01", "2020-01-26"),
  height_m = c(1.77, 144, 1.89, 1.74),
  stringsAsFactors = FALSE)

For illustration we consider a very simple data set:

raw_data

We have two simple data-preparation functions for our raw-data-set:

correct_height <- function(raw_data) {
  ret <- raw_data
  # functions starting with sc_ are convenience functions the package
  # offers for ease of use
  sc <- sanityTracker::sc_cols_bounded_above(
    object = ret,
    cols = "height_m",
    upper_bound = 100,
    description = "Persons are smaller than 100m",
    counter_meas = "Divide by 100. Assume height is given in cm",
  )

  if (sc[["height_m"]][["fail"]]) {
    fail_vec <- sc[["height_m"]][["fail_vec"]]
    ret$height_m[fail_vec] <- ret$height_m[fail_vec] / 100
  }


  sanityTracker::sc_cols_bounded(
    object = ret, 
    cols = "height_m",
    rule = "[0.8, 2.5]",
    description = "Persons are between 0.8m and 2.5m"
  )  
  return(ret)
}

prep <- function(raw_data) {

  sanityTracker::sc_cols_unique(
    object = raw_data,
    cols = "id",
    description = "No duplicated ids"
  )

  raw_data$start <- as.Date(raw_data$start)
  raw_data$end <- as.Date(raw_data$end)
  # sanity checks can be recoreded as long a
  # logical vector exists with add_sanity_check()
  sanityTracker::add_sanity_check(
    fail_vec = raw_data$end < raw_data$start,
    description = "start-date <= end-date",
    data = raw_data
  )

  ret <- correct_height(raw_data = raw_data)  
  return(ret)
}

After applying the prep-function we can summarize the sanity checks

wrangled_data <- prep(raw_data = raw_data)
sanity_checks <- sanityTracker::get_sanity_checks()
sanity_checks

This directly gives an overview of what was performed, which check failed how often, what counter measure was applied and in case of a fail also random rows (by default at most 3) of the data set where the check failed.

sanity_checks[2, ]
sanity_checks[2, ]$example

Installation

You can install it from CRAN

install.packages("sanityTracker")

or github

remotes::install_github("MarselScheer/sanityTracker")

sessionInfo

sessionInfo()


MarselScheer/sanityTracker documentation built on Sept. 10, 2021, 10:23 p.m.