README.md
In byapparov/rdqa: Data Quality Logging Pagckage

Data Quality package simplifies data validation and logging of errors to the database.

Field | Suggested Type | Description ------------ | ------------- | ------------- Date | Timestamp | Time when error was logged Source | String | Data source of the issue Type | String | Type of the issue (enum) Value | String | Value that contains the problem URL | String | Link to the source record

Type | Description ------ | ---------- Orphan | Foreign key does not match the primary key Wrong | Value does not match the business rule Missing | Value in the field is empty Duplicate | Duplicated record


library(rdqa)
# Defining the rules - here we don't need the data yet
r.smaller <- newConditionRule("small", condition = "< big")
r.unique <- newUniqueRule("big")
r.name.required <- newRequiredRule("name")

# Combine the rules in Rules Container
all.rules <- newRulesContainer("test.data.td", r.smaller, r.unique, r.name.required)

# Connect to the db where we want to log the results
conn <- dbConnect(dbDriver("SQLite"), "demo.db")

# We only need data just before the validation
dt <- data.table(id = c(1, 2, 3, 4, 5),
                 small = c(1, 10, 2, 3, 4),
                 big = c(10, 1, 20, 20, 40),
                 name = c("a", "", "c", "d", NA_character_),
                 key = "id")

# This call will validate the data againt all three rules
# and log results to the database
validateRules(conn, all.rules, dt)

# Check what we have in the errors table
res <- dbGetQuery(conn, "SELECT * FROM errors")

date | source | type | rule | ref | value | url -------- | -------- |----- | -------------------------- | --- | ----- | -------------------------- 1486681846 | test.data.td | Condition | Field [small] should matrch condition: < big |2|2|NA 1486685164 | test.data.td | Unique | Field(s) big should be uqniue | 4 | 4 | NA 1486685164 | test.data.td | Required | Field [name] should ont be empty | 2 | 2 | NA 1486685164 | test.data.td | Required | Field [name] should ont be empty | 5 | 5 | NA

Schema object allows to define rules in a more readable layout with rules seating inside the data structure:


# Lets say we want to define rules for customer data, here is a sample schema:

schema.customers <- Schema(
"customer.data",
  schema = list(
    list(
      name = "id",
      description = "This is an integer primary key for our customer table",
      class = "integer", 
      required = TRUE,
      unique = TRUE
    ),
    list(
      name = "name",
      class = "character",
      regex = "\\w"
    ),
    list(
      name = "gender",
      class = "character",
      enum = c("male", "female")
    )
  ),
  rules = list(
    newConditionRule("id", "> 0"),
    newConditionRule("name", condition = expression(nchar(name) < 12))
  )
) 

# These are our customers:
customers <- data.table(
  id = c(1L, 2L, NA_integer_, 3L, 4L, -1L),
  name = c("John", "Isabellarose", "Anna", "Bob", NA_character_, ""),
  gender = c("male", "other", "female", "female", "male", "male"),
  key = "id"
)

# Validate rules and log problems
errors <- validate(schema.customers, customers)
print(errors)

Errors table for this example will have:

n | ref | value | type -- | --- | ----- |----- 1: | -1 | -1 |Condition 2: | 2 | Isabellarose | Condition 3: | NA | NA | Required 4: | -1 | | Regex 5: | 2 | other | Enum

This specification can be used within ETL or data import procedure to identify records with erros.

It will also raise error if column names or types don't match the schema. You can also stop execution based on the records in errors ouput. e.g.: