Data quality indicators with the validate package

library(validate)

We assume that the reader went through the first couple of sections of the introductory vignette.

In the validate package, an 'indicator' is a rule or function that takes as input a data set and outputs a number. Indicators are usually designed to be easily interpretable by domain experts and therefore depend strongly on the application. In 'validate' users are free to specify indicator. By specifing them separate from the programming workflow, they can be treated as first-class objects: indicator specs can be maintained, version-controlled, and documented in separate files (just like validation rules.)

Workflow

Here is a simple example of the workflow.

i <- indicator(
    mh  = mean(height)
  , mw  = mean(weight)
  , BMI = (weight/2.2046)/(height*0.0254)^2 )
ind <- confront(women, i)

In the first statement we define an indicator object storing indicator expressions. Next, we confront a dataset with these indicators. The result is an object of class indication. It prints as follows.

ind

To study the results, the object can be summarized.

summary(ind)

Observe that the first two indicators result in a single value (mh, mw) and the third one results in 15 values (BMI). The columns error and warning indicate wether calculation of the indicators was problematic.

A specific problem that may occur is when the result of an indicator is non-numeric.

jj <- indicator(mh = mean(height), a = {"A"})

here, the second 'indicator' is an expression that always yields a constant (the character string "A").

cf <- confront(women, jj)
cf
warnings(cf)

Getting the values

Values can be obtained with the values function, or by converting to a data.frame.

We add a unique identifier (this is optional) to make it easier to connect results with the data.

women$id <- letters[1:15]

Compute indicators and convert to data.frame.

ind <- confront(women, i,key="id")
(out <- as.data.frame(ind))

Observe that there is no key for indicators mh and mw since these are constructed from multiple records.

Indicators and data.frames

Indicators can be constructed from and coerced to data.frames. To define an indicator you need to create a data.frame that at least has a character column called rule. All other columns are optional.

idf <- data.frame(
  rule = c("mean(height)","sd(height)")
  , label = c("average height", "std.dev height")
  , description = c("basic statistic","fancy statistic")
)
i <- indicator(.data=idf)
i

Now, confront with data and merge the results back with rule metadata.

quality <- as.data.frame(confront(women, i))
measures <- as.data.frame(i)
merge(quality, measures)


Try the validate package in your browser

Any scripts or data that you put into this service are public.

validate documentation built on Aug. 8, 2017, 1:06 a.m.