Description Usage Arguments Details Value Author(s) See Also Examples
Perform sanity checks on a dataframe.
This function can be used after performing some data munging to check for mistakes.
1 2 3 4 |
data |
dataframe to be checked |
subset |
(optional) logical expression indicating subset of 'data' to check (see |
min_rows |
(optional) minimum number of rows (compare with dim(data[subset,])[1]). |
max_rows |
(optional) maximum number of rows (compare with dim(data[subset,])[1]). |
min_cols |
(optional) minimum number of columns (compare with dim(data[subset,])[2]). |
max_cols |
(optional) maximum number of columns (compare with dim(data[subset,])[2]). |
min_cc |
(optional) minimum number of complete cases (compare with sum(complete.cases(data[subset,]))) |
max_cc |
(optional) maximum number of complete cases (compare with sum(complete.cases(data[subset,]))) |
min_uniq |
(optional) minimum number of unique cases (compare with dim(unique(data[subset,]))[1]). Default value is 1. |
max_uniq |
(optional) maximum number of unique cases (compare with dim(unique(data[subset,]))[1]) |
min_na_row |
(optional) minimum number of missing values in each row |
max_na_row |
(optional) maximum number of missing values in each row |
min_na_all |
(optional) minimum number of missing values overall |
max_na_all |
(optional) maximum number of missing values overall |
checksum |
(optional) a checksum of the variable as returned by digest(VAR,algo="crc32"). |
showbadrows |
(optional) if a positive integer N then print the first N non-matching rows (only for tests on rows. Default: N = 100). |
silent |
(optional) if TRUE then don't omit warning messages informing of error type (FALSE by default) |
stoponfail |
(optional) if TRUE then throw an error on the first check that fails (FALSE by default) |
vars |
(optional) either a numeric or character vector, or a regexp matching names of variables to check |
checks |
(optional) a list of a arguments to be passed to |
You can restrict the checks to a subset of the dataframe by supplying a logical expression in the 'subset' argument. This expression will be evaluated in the context of the supplied dataframe (the 'data' argument), so you don't need to qualify the variable names. If all arguments apart from 'data', 'subset', 'silent' and 'stoponfail' are unset/NULL then the function will check if all rows satisfy the subset logical expression (unless this is unset). The other arguments can used for checking the number of complete cases (i.e. rows with no missing values), unique cases, missing values, checksum, and variable specific checks (see below).
For arguments with names beginning with 'min_' or 'max_' you can supply either a whole number indicating an amount of rows/columns, or a number between 0 & 1 indicating a proportion of rows/columns. For 'min_rows' & 'max_rows' proportions are interpreted as proportions of the whole data (before subsetting), whereas for other arguments proportions are interpreted as proportions of the subsetted data.
To perform variable specific checks use the 'vars' argument to specify which variables to check. 'vars' can be
either a numeric vector of column numbers, or a character vector of regexps matching column names. The matching
columns will be individually checked by the checkVar
function. To specify which checks to perform
supply a list of arguments for checkVar
in the 'checks' argument. You do not need to include the data
or var arguments in this list. For example to check that all variables with names matching "country" or "name" have
type "character" and between 10 & 300 unique values you could do:
checkDF(data,vars=c("country","name"),checks=list(type="character",min_uniq=10,max_uniq=300))
To ensure that a regexp matches only a single variable put a ^ at the front and $ at the end (e.g. "^country$").
Note: the values of the 'silent' and 'stoponfail' args will be passed on the checkVar
by default
but you can override these values by passing new values in the 'checks' arg.
By default a warning message will be issued when a check fails. This can be prevented by setting 'silent' to TRUE. If the 'stoponfail' argument is set to TRUE then an error will be thrown on the first check that fails, otherwise the return value of the function indicates whether all checks passed (TRUE) or not (FALSE).
A list whose first element is TRUE if all checks passed, FALSE otherwise, and whose subsequent elements are vectors of indices of non-matching rows for tests on rows.
Ben Veal
1 2 | checkDF(ChickWeight,weight>Time)
checkDF(ChickWeight,min_uniq=10)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.