README.md

Build Status codecov.io

Data Quality package simplifies data validation and logging of errors to the database.

Metadata Schema

Field | Suggested Type | Description ------------ | ------------- | ------------- Date | Timestamp | Time when error was logged Source | String | Data source of the issue Type | String | Type of the issue (enum) Value | String | Value that contains the problem URL | String | Link to the source record

Error types

Type | Description ------ | ---------- Orphan | Foreign key does not match the primary key Wrong | Value does not match the business rule Missing | Value in the field is empty Duplicate | Duplicated record

Example

Code


library(rdqa)
# Defining the rules - here we don't need the data yet
r.smaller <- newConditionRule("small", condition = "< big")
r.unique <- newUniqueRule("big")
r.name.required <- newRequiredRule("name")

# Combine the rules in Rules Container
all.rules <- newRulesContainer("test.data.td", r.smaller, r.unique, r.name.required)

# Connect to the db where we want to log the results
conn <- dbConnect(dbDriver("SQLite"), "demo.db")

# We only need data just before the validation
dt <- data.table(id = c(1, 2, 3, 4, 5),
                 small = c(1, 10, 2, 3, 4),
                 big = c(10, 1, 20, 20, 40),
                 name = c("a", "", "c", "d", NA_character_),
                 key = "id")

# This call will validate the data againt all three rules
# and log results to the database
validateRules(conn, all.rules, dt)

# Check what we have in the errors table
res <- dbGetQuery(conn, "SELECT * FROM errors")


Expected result

date | source | type | rule | ref | value | url -------- | -------- |----- | -------------------------- | --- | ----- | -------------------------- 1486681846 | test.data.td | Condition | Field [small] should matrch condition: < big |2|2|NA 1486685164 | test.data.td | Unique | Field(s) big should be uqniue | 4 | 4 | NA 1486685164 | test.data.td | Required | Field [name] should ont be empty | 2 | 2 | NA 1486685164 | test.data.td | Required | Field [name] should ont be empty | 5 | 5 | NA

Defining rules through Schema object

Schema object allows to define rules in a more readable layout with rules seating inside the data structure:


# Lets say we want to define rules for customer data, here is a sample schema:

schema.customers <- Schema(
"customer.data",
  schema = list(
    list(
      name = "id",
      description = "This is an integer primary key for our customer table",
      class = "integer", 
      required = TRUE,
      unique = TRUE
    ),
    list(
      name = "name",
      class = "character",
      regex = "\\w"
    ),
    list(
      name = "gender",
      class = "character",
      enum = c("male", "female")
    )
  ),
  rules = list(
    newConditionRule("id", "> 0"),
    newConditionRule("name", condition = expression(nchar(name) < 12))
  )
) 

# These are our customers:
customers <- data.table(
  id = c(1L, 2L, NA_integer_, 3L, 4L, -1L),
  name = c("John", "Isabellarose", "Anna", "Bob", NA_character_, ""),
  gender = c("male", "other", "female", "female", "male", "male"),
  key = "id"
)

# Validate rules and log problems
errors <- validate(schema.customers, customers)
print(errors)

Errors table for this example will have:

n | ref | value | type -- | --- | ----- |----- 1: | -1 | -1 |Condition 2: | 2 | Isabellarose | Condition 3: | NA | NA | Required 4: | -1 | | Regex 5: | 2 | other | Enum

This specification can be used within ETL or data import procedure to identify records with erros.

It will also raise error if column names or types don't match the schema. You can also stop execution based on the records in errors ouput. e.g.:

assert_that(nrow(errors) == 0)

Once you have schemas set up for your data processes it is quite easy to add monitoring suite using errors table.



byapparov/rdqa documentation built on May 13, 2019, 9:54 a.m.