Create a backtracker object for error localization

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ```
errorLocalizer(E, x, ...)
## S3 method for class 'editset'
errorLocalizer(E, x, ...)
## S3 method for class 'editmatrix'
errorLocalizer(E, x, weight = rep(1, length(x)),
maxadapt = length(x), maxweight = sum(weight), maxduration = 600,
tol = sqrt(.Machine$double.eps), ...)
## S3 method for class 'editarray'
errorLocalizer(E, x, weight = rep(1, length(x)),
maxadapt = length(x), maxweight = sum(weight), maxduration = 600, ...)
## S3 method for class 'editlist'
errorLocalizer(E, x, weight = rep(1, length(x)),
maxadapt = length(x), maxweight = sum(weight), maxduration = 600, ...)
``` |

`E` |
an |

`x` |
a named numerical |

`...` |
Arguments to be passed to other methods (e.g. reliability weights) |

`weight` |
a |

`maxadapt` |
maximum number of variables to adapt |

`maxweight` |
maximum weight of solution, if weights are not given, this is equal to the maximum number of variables to adapt. |

`maxduration` |
maximum time (in seconds), for |

`tol` |
tolerance passed to |

an object of class `backtracker`

. Each execution of `$searchNext()`

yields a solution
in the form of a `list`

(see details). Executing `$searchBest()`

returns the lowest-weight solution.
When multiple solotions with the same weight are found, `$searchBest()`

picks one at random.

Generate a `backtracker`

object for error localization in numerical, categorical, or mixed data.
This function generates the workhorse program, called by `localizeErrors`

with `method=localizer`

.

The returned `backtracker`

can be used to run a branch-and-bound algorithm which finds
the least (weighted) number of variables in `x`

that need to be adapted so that all restrictions
in `E`

can be satisfied. (Generalized principle of Fellegi and Holt (1976)).

The B&B tree is set up so that in in one branche,
a variable is assumed correct and its value subsituted in `E`

, while in the other
branche a variable is assumed incorrect and `eliminated`

from `E`

.
See De Waal (2003), chapter 8 or De Waal, Pannekoek and Scholtus (2011) for
a concise description of the B&B algorithm.

Every call to `<backtracker>$searchNext()`

returns one solution `list`

, consisting of

w: The solution weight.

adapt:

`logical`

indicating whether a variable should be adapted (`TRUE`

) or not

Every subsequent call leads either to `NULL`

, in which case either all solutions have been found,
or `maxduration`

was exceeded. The property `<backtracker>$maxdurationExceeded`

indicates if this is
the case. Otherwise, a new solution with a weight `w`

not higher than the weight of the last found solution
is returned.

Alternatively `<backtracker>$searchBest()`

will return the best solution found within `maxduration`

seconds.
If multiple equivalent solutions are found, a random one is returned.

The backtracker is prepared such that missing data in the input record `x`

is already
set to adapt, and missing variables have been eliminated already.

The backtracker will crash when `E`

is an `editarray`

and one or more values are
not in the datamodel specified by `E`

. The more user-friendly function `localizeErrors`

circumvents this. See also `checkDatamodel`

.

For records with a large numerical range (*eg* 1-1E9), the error locations represent solutions that
will allow repairing the record to within roundoff errors. We highly recommend that you round near-zero
values (for example, everything `<= sqrt(.Machine$double.eps)`

) and scale a record with values larger
than or equal to 1E9 with a constant factor.

This method is potentially very slow for objects of class `editset`

that contain
many conditional restrictions. Consider using `localizeErrors`

with the option
`method="mip"`

in such cases.

I.P. Fellegi and D. Holt (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association 71, pp 17-25

T. De Waal (2003) Processing of unsave and erroneous data. PhD thesis, Erasmus Research institute of management, Erasmus university Rotterdam. http://www.cbs.nl/nl-NL/menu/methoden/onderzoek-methoden/onderzoeksrapporten/proefschriften/2008-proefschrift-de-waal.htm

T. De Waal, Pannekoek, J. and Scholtus, S. (2011) Handbook of Statistical Data Editing. Wiley Handbooks on Survey Methodology.

`errorLocalizer_mip`

, `localizeErrors`

, `checkDatamodel`

, `violatedEdits`

,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | ```
#### examples with numerical edits
# example with a single editrule
# p = profit, c = cost, t = turnover
E <- editmatrix(c("p + c == t"))
cp <- errorLocalizer(E, x=c(p=755, c=125, t=200))
# x obviously violates E. With all weights equal, changing any variable will do.
# first solution:
cp$searchNext()
# second solution:
cp$searchNext()
# third solution:
cp$searchNext()
# there are no more solution since changing more variables would increase the
# weight, so the result of the next statement is NULL:
cp$searchNext()
# Increasing the reliability weight of turnover, yields 2 solutions:
cp <- errorLocalizer(E, x=c(p=755, c=125, t=200), weight=c(1,1,2))
# first solution:
cp$searchNext()
# second solution:
cp$searchNext()
# no more solutions available:
cp$searchNext()
# A case with two restrictions. The second restriction demands that
# c/t >= 0.6 (cost should be more than 60% of turnover)
E <- editmatrix(c(
"p + c == t",
"c - 0.6*t >= 0"))
cp <- errorLocalizer(E,x=c(p=755,c=125,t=200))
# Now, there's only one solution, but we need two runs to find it (the 1st one
# has higher weight)
cp$searchNext()
cp$searchNext()
# With the searchBest() function, the lowest weifght solution is found at once:
errorLocalizer(E,x=c(p=755,c=125,t=200))$searchBest()
# An example with missing data.
E <- editmatrix(c(
"p + c1 + c2 == t",
"c1 - 0.3*t >= 0",
"p > 0",
"c1 > 0",
"c2 > 0",
"t > 0"))
cp <- errorLocalizer(E,x=c(p=755, c1=50, c2=NA,t=200))
# (Note that e2 is violated.)
# There are two solutions. Both demand that c2 is adapted:
cp$searchNext()
cp$searchNext()
##### Examples with categorical edits
#
# 3 variables, recording age class, position in household, and marital status:
# We define the datamodel and the rules
E <- editarray(expression(
age %in% c('under aged','adult'),
maritalStatus %in% c('unmarried','married','widowed','divorced'),
positionInHousehold %in% c('marriage partner', 'child', 'other'),
if( age == 'under aged' )
maritalStatus == 'unmarried',
if( maritalStatus %in% c('married','widowed','divorced'))
!positionInHousehold %in% c('marriage partner','child')
)
)
E
# Let's define a record with an obvious error:
r <- c(
age = 'under aged',
maritalStatus='married',
positionInHousehold='child')
# The age class and position in household are consistent, while the marital
# status conflicts. Therefore, changing only the marital status (in stead of
# both age class and postition in household) seems reasonable.
el <- errorLocalizer(E,r)
el$searchNext()
``` |

