library(datacheck)
library(xtable)
options(xtable.type = 'html')

A vignette for datacheck (version r pkg_version("datacheck"))

Reinhard Simon, International Potato Center, Lima, Peru

The library datacheck provides some simple functions to check the consistency of a dataset. It assumes data are available in tabular format - typically a csv file with objects or records in rows and attributes or variables in the columns.

In a database setting the variables would be controlled by the database - at least conformance to types (character, numeric, etc) and allowed min/maximum values. However, often data are gathered in simple spreadsheets or are for other reasons without such constraints. Here, data constraints like allowed types or values, expected values and relationships can be defined using R commands and syntax. This allows much more flexibility and fine grained control. Typically it demands also a lot of domain knowledge from the user. It is therefore often useful to re-use such domain aware rule files across tables with similar content. Therefore this tool is foregiving if rules cannot be executed if a variable is not present in the table to be analyzed allowing the reuse of such rule files.

Using the HTML interface

Use the following commands to copy some example files to your current working directory (uncomment the file.copy command):

atable = system.file("examples/soilsamples.csv", package="datacheck")
srules = system.file("examples/soil_rules.R", package="datacheck")

# Uncomment the next two lines

# file.copy(atable, "soilsamples.csv")
# file.copy(srules, "soil_rules.R")

Then type in the command run_datacheck() in the R editor.

Use the upload buttons to load the respective files in your working directory. Review the results.

Using the command line interface

Assuming you have copied the above mentioned files in your working directory proceed to read in the data.

atable = read.csv(atable, header = TRUE, stringsAsFactors = FALSE)
srules = read_rules(srules)
profil = datadict_profile(atable, srules)

You can inspect a graphical summary of rules per variable:

rule_coverage(profil)

The cumulative number of records with increasing scores.

score_sum(profil)

Or see the tables (only the first 20 records and first 6 columns shown):

xtable(atable[1:20, 1:6])

Similarly for the score table; however, this table contains also the total counts of scores by records and variables. In addition, the maximum score by variable.

ps = profil$scores
recs = c(1:10, nrow(ps)-1, nrow(ps))
cols = c(1:4,  ncol(ps))
xtable(ps[recs, cols])

A last visualization is a heatmap of the score table to organize similar records and similar rule profiles to help detect any patterns,

#filter out only records with less than maximum points
mp = max(ps$Record.score[nrow(ps)-2])

heatmap_quality(profil, scoreMax = mp)

Checking tables with data inconsistencies

For comparative purposes we purposely introduce a few errors in our table as below. We also exclude a rule on soil types for better display.

atable$P[1]  = -100
atable$pH[11]= -200
srule1 = srules[-c(33),]
profil = datadict_profile(atable, srule1)

To get a better handle on the data it is always informative to review simple descriptive summaries of the data. A custom summary function is included in the package to display this summary in tabular form:

xtable(short_summary(atable))

A summary of the results by rule can be seen from the profil object:

xtable(profil$checks)

The checks part lists all erroneous records in the last column for each rule. This may be too long for printing. To this end a custom print report function only displays the first n records where n=5 is the default.

atable$Sand[20:30] = -1
profil = datadict_profile(atable, srule1)
xtable(prep4rep(profil$checks))

Using rules that can't be executed

This may happen if the syntax is wrong. Another reason - particularly if re-using rule files across tables - maybe that a particular variable name is not present amongst the column names of the present table. The tool will just ignore it and report a 'failed' execution. Let us simply modify an existing rule as below:

srule1$Variable[25] = "caCO3"
srule1$Rule[25] = "caCO3 >= 0"
profil = datadict_profile(atable, srule1)

Now let us just look at an excerpt of the results table:

xtable(prep4rep(profil$checks[20:30,]))

End of tutorial



c5sire/datacheck documentation built on May 13, 2019, 10:32 a.m.