This algorithm tries to detect and repair records that violate linear equality constraints by correcting simple typo's as described in Scholtus (2009). The implemention of the detection of typing errors differs in that it uses the restricted DamerauLevensthein distance. Furthermore it solves a broader class of problems: the original paper describes the class of equalities: Ex=0 (balance edits) and this implementation allows for Ex=a.
1 2 3 4 5 6 7 8  correctTypos(E, dat, ...)
## S3 method for class 'editset'
correctTypos(E, dat, ...)
## S3 method for class 'editmatrix'
correctTypos(E, dat, fixate = NULL, cost = c(1, 1, 1,
1), eps = sqrt(.Machine$double.eps), maxdist = 1, ...)

E 

dat 

... 
arguments to be passed to other methods. 
fixate 

cost 
for a deletion, insertion, substition or transposition. 
eps 

maxdist 

For each row in dat
the correction algorithm first detects if row x
violates the equality constraints of E
taking possible rounding errors into account.
Mathematically:
∑_{i=1}^nE_{ji}x_i  a_j ≤q \varepsilon,\quad \forall j
It then generates correction suggestions by deriving alternative values for variables only involved in the violated edits. The correction suggestions must be within a typographical edit distance (default = 1) to be selected. If there are more then 1 solutions possible the algorithm tries to derive a partial solution, otherwise the solution is applied to the data.
correctTypos
returns an object of class deducorrect
object describing the status of the record and the corrections that have been applied.
Inequalities in editmatrix E
will be ignored in this algorithm, so if this is the case, the corrected records
are valid according to the equality restrictions, but may be incorrect for the given inequalities.
Please note that if the returned status of a record is "partial" the corrected record still is not valid.
The partially corrected record will contain less errors and will violate less constraints.
Also note that the status "valid" and "corrected" have to be interpreted in combination with eps
.
A common scenario is first to correct for typo's and then correct for rounding errors. This means that in the first
step the algorithm should allow for typo's (e.g. eps==2
). The returned "valid" record therefore may still contain
rounding errors.
deducorrect
object with corrected data.frame, applied corrections and status of the records.
Scholtus S (2009). Automatic correction of simple typing errors in numerical data with balance edits. Discussion paper 09046, Statistics Netherlands, The Hague/Heerlen.
Damerau F (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7,issue 3
Levenshtein VI (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10: 70710
A good description of the restricted DLdistance can be found on wikipedia: http://en.wikipedia.org/wiki/Damerau
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48  library(editrules)
# example from section 4 in Scholtus (2009)
E < editmatrix( c("x1 + x2 == x3"
,"x2 == x4"
,"x5 + x6 + x7 == x8"
,"x3 + x8 == x9"
,"x9  x10 == x11"
)
)
dat < read.csv(txt<textConnection(
" , x1, x2 , x3 , x4 , x5 , x6, x7, x8 , x9 , x10 , x11
4 , 1452, 116, 1568, 116, 323, 76, 12, 411, 1979, 1842, 137
4.1, 1452, 116, 1568, 161, 323, 76, 12, 411, 1979, 1842, 137
4.2, 1452, 116, 1568, 161, 323, 76, 12, 411, 19979, 1842, 137
4.3, 1452, 116, 1568, 161, 0, 0, 0, 411, 19979, 1842, 137
4.4, 1452, 116, 1568, 161, 323, 76, 12, 0, 19979, 1842, 137"
))
close(txt)
(cor < correctTypos(E,dat))
# example with editset
E < editset(expression(
x + y == z,
x >= 0,
y > 0,
y < 2,
z > 1,
z < 3,
A %in% c('a','b'),
B %in% c('c','d'),
if ( A == 'a' ) B == 'b',
if ( B == 'b' ) x > 3
))
x < data.frame(
x = 10,
y = 1,
z = 2,
A = 'a',
B = 'b'
)
correctTypos(E,x)

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.