checkDF: Perform sanity checks on a dataframe.

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/merge_utils.R

Description

Perform sanity checks on a dataframe.

This function can be used after performing some data munging to check for mistakes.

Usage

1
2
3
4
checkDF(data, subset, min_rows, max_rows, min_cols, max_cols, min_cc, max_cc,
  min_uniq, max_uniq, min_na_row, max_na_row, min_na_all, max_na_all, checksum,
  showbadrows = 100, silent = FALSE, stoponfail = FALSE, vars = NULL,
  checks = NULL)

Arguments

data

dataframe to be checked

subset

(optional) logical expression indicating subset of 'data' to check (see subset)

min_rows

(optional) minimum number of rows (compare with dim(data[subset,])[1]).

max_rows

(optional) maximum number of rows (compare with dim(data[subset,])[1]).

min_cols

(optional) minimum number of columns (compare with dim(data[subset,])[2]).

max_cols

(optional) maximum number of columns (compare with dim(data[subset,])[2]).

min_cc

(optional) minimum number of complete cases (compare with sum(complete.cases(data[subset,])))

max_cc

(optional) maximum number of complete cases (compare with sum(complete.cases(data[subset,])))

min_uniq

(optional) minimum number of unique cases (compare with dim(unique(data[subset,]))[1]). Default value is 1.

max_uniq

(optional) maximum number of unique cases (compare with dim(unique(data[subset,]))[1])

min_na_row

(optional) minimum number of missing values in each row

max_na_row

(optional) maximum number of missing values in each row

min_na_all

(optional) minimum number of missing values overall

max_na_all

(optional) maximum number of missing values overall

checksum

(optional) a checksum of the variable as returned by digest(VAR,algo="crc32").

showbadrows

(optional) if a positive integer N then print the first N non-matching rows (only for tests on rows. Default: N = 100).

silent

(optional) if TRUE then don't omit warning messages informing of error type (FALSE by default)

stoponfail

(optional) if TRUE then throw an error on the first check that fails (FALSE by default)

vars

(optional) either a numeric or character vector, or a regexp matching names of variables to check

checks

(optional) a list of a arguments to be passed to checkVar

Details

You can restrict the checks to a subset of the dataframe by supplying a logical expression in the 'subset' argument. This expression will be evaluated in the context of the supplied dataframe (the 'data' argument), so you don't need to qualify the variable names. If all arguments apart from 'data', 'subset', 'silent' and 'stoponfail' are unset/NULL then the function will check if all rows satisfy the subset logical expression (unless this is unset). The other arguments can used for checking the number of complete cases (i.e. rows with no missing values), unique cases, missing values, checksum, and variable specific checks (see below).

For arguments with names beginning with 'min_' or 'max_' you can supply either a whole number indicating an amount of rows/columns, or a number between 0 & 1 indicating a proportion of rows/columns. For 'min_rows' & 'max_rows' proportions are interpreted as proportions of the whole data (before subsetting), whereas for other arguments proportions are interpreted as proportions of the subsetted data.

To perform variable specific checks use the 'vars' argument to specify which variables to check. 'vars' can be either a numeric vector of column numbers, or a character vector of regexps matching column names. The matching columns will be individually checked by the checkVar function. To specify which checks to perform supply a list of arguments for checkVar in the 'checks' argument. You do not need to include the data or var arguments in this list. For example to check that all variables with names matching "country" or "name" have type "character" and between 10 & 300 unique values you could do:

checkDF(data,vars=c("country","name"),checks=list(type="character",min_uniq=10,max_uniq=300))

To ensure that a regexp matches only a single variable put a ^ at the front and $ at the end (e.g. "^country$"). Note: the values of the 'silent' and 'stoponfail' args will be passed on the checkVar by default but you can override these values by passing new values in the 'checks' arg.

By default a warning message will be issued when a check fails. This can be prevented by setting 'silent' to TRUE. If the 'stoponfail' argument is set to TRUE then an error will be thrown on the first check that fails, otherwise the return value of the function indicates whether all checks passed (TRUE) or not (FALSE).

Value

A list whose first element is TRUE if all checks passed, FALSE otherwise, and whose subsequent elements are vectors of indices of non-matching rows for tests on rows.

Author(s)

Ben Veal

See Also

checkVar

Examples

1
2
checkDF(ChickWeight,weight>Time)
checkDF(ChickWeight,min_uniq=10)

vapniks/mergeutils documentation built on May 3, 2019, 4:33 p.m.