README.md

Build Status codecov

dataque

Data Quality control framework for dataframes in R.

With some creativity checks can be performed on anything that fits dataframe. For example: directory can be checked for presence of all required files.

purpose

The purspoe of this package is to provide easy for use functionality to apply data-quality control to dataframes (as a main data representation format). This implementation is influenced and inspired by awslabs/deequ and related paper Automating large-scale data quality verification.

quick start

Install package from github using devtools package

devtools::install_github(repo = "EvgenyPetrovsky/deeque")

define test data-set and check it

library(magrittr)
library(deeque)

# define dataset
test_df <- as.data.frame(datasets::Titanic, stringsAsFactors = T)

# define checks
checks <-
  new_group() %>%
  add_check(new_check(
    description = "Dataset must have column 'Class'",
    severity = "ERROR", function_name = "tab_hasColumn", column = "Class"
  )) %>%
  add_check(new_check(
    "Minimum value of 'Freq' must be positive number",
    "INFO", col_hasMin, column = "Freq", udf = function(x) {x >= 0}
  )) %>%
  add_check(new_check(
    "Column 'Sex' must value have values from list [Male, Female]",
    "WARNING", col_isInLOV, column = "Sex", lov = factor(c("Male", "Female"))
  )) %>%
  add_check(new_check(
    "Combination of values in columns 'Class', 'Sex', 'Age', 'Survived' must be unique",
    "WARNING", tab_hasUniqueKey, columns = c("Class", "Sex", "Age", "Survived")
  ))

# show checks
checks %>% convert_checks_to_df()

# verify dataset using checks defined
chk_res <- test_df %>% run_checks(checks)

# view dataset in Rstudio
View(chk_res %>% convert_run_results_to_df())

# another option verify; stop execution if condition is not satisfied
test_df %T>%
  run_checks(
    checks,
    condition = severity_under_threshold(severity$WARNING)
  ) %>%
  head(5)

or check folder content for all required files

library(magrittr)
library(deeque)

# dir() function should be called here with proper parameters
# dir_content <- dir(recursive = TRUE)
dir_content <- c(
  "REAMDE.txt",
  "config.yaml",
  "input_data/internal/employees.csv",
  "input_data/internal/goals.csv",
  "input_data/internal/incidents.csv",
  "input_data/external/market_rates.csv"
)

# put dir content into data frame
test_df <- data.frame(
  file_name = dir_content,
  stringsAsFactors = FALSE
)

# define checks
checks <-
  new_group() %>%
  add_check(new_check(
    "Data about employees and goals musrt present in internal input folder",
    "ERROR", col_hasAllValues, column = "file_name",
    lov = c(
      "input_data/internal/employees.csv",
      "input_data/internal/goals.csv")
  ))

# verify dataset using checks defined
chk_res <- test_df %>% run_checks(checks)

# view dataset in Rstudio
View(chk_res %>% convert_run_results_to_df())

structure

This data quality framework defines following building blocks:

standard user flow

User has a dataset and needs to ensure that its shape and content meets requirements. For this reason:

  1. User specified dq checks by describing them in a declarative way one by one. Checks themselves are instructions that are applied later to data checks are combined in groups.
  2. User might decide to export check groups and load them later for future use or for sharing with others.
  3. User runs verification by saying what group of checks needs to be applied to dataset.

Results of execution may be:

important information about functions

Some functions operate with statistics (like min, max, uniqueness ratio) and can return only one logical value, this can be TRUE / FALSE. Others operate on lower lever and return value for every element. They return logical vector of values. Both of these cases may be properly treated by basic data.frame functionality and data manipulation packages such as dlpyr. It is up to user to decide what result to use.

However, when implementing functions, one should think what is proper result and either return vector for every row that was checked or return 1 value. There is no reason to replicate one value to number of rows.



EvgenyPetrovsky/deeque documentation built on Jan. 23, 2020, 3:38 a.m.