Localize errors on records in a data.frame.
Description
For each record in a data.frame
, the least (weighted) number of fields is
determined which can be adapted or imputed so that no edit in E
is violated. Anymore.
Usage
1 2 3 
Arguments
E 
an object of class 
dat 
a 
verbose 
print progress to screen? 
weight 
Vector of positive weights for every variable in 
maxduration 
maximum time for 
method 
should errorlocalizer ("bb") or mix integer programming ("mip") be used? 
useBlocks 

retrieve 
Return the first found solution or the best solution? ("bb" method only). 
... 
Further options to be passed to 
Details
For performance purposes, the edits are split in independent blocks
which are processed
separately. Also, a quick vectorized check with checkDatamodel
is performed first to
exclude variables violating their onedimensional bounds from further calculations.
By default, all weights are set equal to one (each variable is considered equally reliable). If a vector
of weights is passed, the weights are assumed to be in the same order as the columns of dat
. By passing
an array of weights (of same dimensions as dat
) separate weights can be specified for each record.
In general, the solution to an error localization problem need not be unique, especially when no weights
are defined. In such cases, localizeErrors
chooses a solution randomly. See errorLocalizer
for more control options.
Error localization can be performed by the Branch and Bound method of De Waal (2003) (option method="localizer"
, the default)
or by rewriting the problem as a mixedinteger programming (MIP) problem (method="mip"
) which is parsed to
the lpsolve
library. The former case uses errorLocalizer
and is very reliable in terms
of numerical stability, but may be slower in some cases (see note below). The MIP approach is much faster,
but requires that upper and lower bounds are set on each numerical variable. Sensible bounds are derived
automatically (see the vignette on error localization as MIP), but could cause instabilities in very rare cases.
Value
an object of class errorLocation
Note
As of version 2.8.1 method 'bb' is not available for conditional numeric (e.g: if (x>0) y>0
)
or conditional edits of mixed type (e.g. if (A=='a') x>0
).
References
T. De Waal (2003) Processing of Erroneous and Unsafe Data. PhD thesis, University of Rotterdam.
E. De Jonge and Van der Loo, M. (2012) Error localization as a mixedinteger program in editrules (included with the package)
lp_solve and Kjell Konis. (2011). lpSolveAPI: R Interface for lp_solve version 5.5.2.0. R package version 5.5.2.05. http://CRAN.Rproject.org/package=lpSolveAPI
See Also
errorLocalizer
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132  # an editmatrix and some data:
E < editmatrix(c(
"x + y == z",
"x > 0",
"y > 0",
"z > 0"))
dat < data.frame(
x = c(1,1,1),
y = c(1,1,1),
z = c(2,0,2))
# localize all errors in the data
err < localizeErrors(E,dat)
summary(err)
# what has to be adapted:
err$adapt
# weight, number of equivalent solutions, timings,
err$status
## Not run
# Demonstration of verbose processing
# construct 2block editmatrix
F < editmatrix(c(
"x + y == z",
"x > 0",
"y > 0",
"z > 0",
"w > 10"))
# Using 'dat' as defined above, generate some extra records
dd < dat
for ( i in 1:5 ) dd < rbind(dd,dd)
dd$w < sample(12,nrow(dd),replace=TRUE)
# localize errors verbosely
(err < localizeErrors(F,dd,verbose=TRUE))
# printing is cut off, use summary for an overview
summary(err)
# or plot (not very informative in this artificial example)
plot(err)
## End(Not run)
for ( d in dir("../pkg/R",full.names=TRUE)) dmp < source(d)
# Example with different weights for each record
E < editmatrix('x + y == z')
dat < data.frame(
x = c(1,1),
y = c(1,1),
z = c(1,1))
# At equal weights, both records have three solutions (degeneracy): adapt x, y
# or z:
localizeErrors(E,dat)$status
# Set different weights per record (lower weight means lower reliability):
w < matrix(c(
1,2,2,
2,2,1),nrow=2,byrow=TRUE)
localizeErrors(E,dat,weight=w)
# an example with categorical variables
E < editarray(expression(
age %in% c('under aged','adult'),
maritalStatus %in% c('unmarried','married','widowed','divorced'),
positionInHousehold %in% c('marriage partner', 'child', 'other'),
if( age == 'under aged' ) maritalStatus == 'unmarried',
if( maritalStatus %in% c('married','widowed','divorced'))
!positionInHousehold %in% c('marriage partner','child')
)
)
E
#
dat < data.frame(
age = c('under aged','adult','adult' ),
maritalStatus=c('married','unmarried','widowed' ),
positionInHousehold=c('child','other','marriage partner')
)
dat
localizeErrors(E,dat)
# the last record of dat has 2 degenerate solutions. Running the last command
# a few times demonstrates that one of those solutions is chosen at random.
# Increasing the weight of 'positionInHousehold' for example, makes the best
# solution unique again
localizeErrors(E,dat,weight=c(1,1,2))
# an example with mixed data:
E < editset(expression(
x + y == z,
2*u + 0.5*v == 3*w,
w >= 0,
if ( x > 0 ) y > 0,
x >= 0,
y >= 0,
z >= 0,
A %in% letters[1:4],
B %in% letters[1:4],
C %in% c(TRUE,FALSE),
D %in% letters[5:8],
if ( A %in% c('a','b') ) y > 0,
if ( A == 'c' ) B %in% letters[1:3],
if ( !C == TRUE) D %in% c('e','f')
))
set.seed(1)
dat < data.frame(
x = sample(1:8),
y = sample(1:8),
z = sample(10),
u = sample(1:8),
v = sample(1:8),
w = sample(10),
A = sample(letters[1:4],10,replace=TRUE),
B = sample(letters[1:4],10,replace=TRUE),
C = sample(c(TRUE,FALSE),10,replace=TRUE),
D = sample(letters[5:9],10,replace=TRUE),
stringsAsFactors=FALSE
)
(el <localizeErrors(E,dat,verbose=TRUE))

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker. Vote for new features on Trello.