compare | R Documentation |
Compare versions of a data set by comparing their performance against a
set of rules or other quality indicators. This function takes two or
more data sets and compares the perfomance of data set 2,3,\ldots
against that of the first data set (default) or to the previous one
(by setting how='sequential'
).
compare(x, ...)
## S4 method for signature 'validator'
compare(x, ..., .list = list(), how = c("to_first", "sequential"))
## S4 method for signature 'indicator'
compare(x, ..., .list = NULL)
x |
An R object |
... |
data frames, comma separated. Names become column names in the output. |
.list |
Optional list of data sets, will be concatenated with |
how |
how to compare |
For validator
: An array where each column represents
one dataset.
The rows count the following attributes:
Number of validations performed
Number of validations that evaluate to NA
(unverifiable)
Number of validations that evaluate to a logical (verifiable)
Number of validations that evaluate to TRUE
Number of validations that evaluate to FALSE
Number of extra validations that evaluate to NA
(new unverifiable)
Number of validations that still evaluate to NA
(still unverifialble)
Number of validations that still evaluate to TRUE
Number of extra validations that evaluate to TRUE
Number of validations that still evaluate to FALSE
Number of extra validations that evaluate to FALSE
For indicator
: A list with the following components:
numeric
: An array collecting results of scalar indicator (e.g. mean(x)
).
nonnumeric
: An array collecting results of nonnumeric scalar indicators (e.g. names(which.max(table(x))))
array
: A list of arrays, collecting results of vector-indicators (e.g. x/mean(x))
Suppose we have a current and a previous version of a data set. Both
can be inspected by confront
ing them with a rule set.
The status changes in rule violations can be partitioned as shown in the
following figure.
This function computes the partition for two or more
datasets, comparing the current set to the first (default) or to the
previous (by setting compare='sequential'
).
The figure is reproduced from MPJ van der Loo and E. De Jonge (2018) Statistical Data Cleaning with applications in R (John Wiley & Sons).
Other validation-methods:
aggregate,validation-method
,
all,validation-method
,
any,validation-method
,
barplot,validation-method
,
check_that()
,
confront()
,
event()
,
names<-,rule,character-method
,
plot,validation-method
,
sort,validation-method
,
summary()
,
validation-class
,
values()
Other comparing:
as.data.frame,cellComparison-method
,
as.data.frame,validatorComparison-method
,
barplot,cellComparison-method
,
barplot,validatorComparison-method
,
cells()
,
match_cells()
,
plot,cellComparison-method
,
plot,validatorComparison-method
data(retailers)
rules <- validator(turnover >=0, staff>=0, other.rev>=0)
# start with raw data
step0 <- retailers
# impute turnovers
step1 <- step0
step1$turnover[is.na(step1$turnover)] <- mean(step1$turnover,na.rm=TRUE)
# flip sign of negative revenues
step2 <- step1
step2$other.rev <- abs(step2$other.rev)
# create an overview of differences, comparing to the previous step
compare(rules, raw = step0, imputed = step1, flipped = step2, how="sequential")
# create an overview of differences compared to raw data
out <- compare(rules, raw = step0, imputed = step1, flipped = step2)
out
# graphical overview
plot(out)
barplot(out)
# transform data to data.frame (easy for use with ggplot)
as.data.frame(out)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.