errorLocalizer: Create a backtracker object for error localization

Description Usage Arguments Value Details Numerical stability issues Note References See Also Examples

Description

Create a backtracker object for error localization

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
errorLocalizer(E, x, ...)

## S3 method for class 'editset'
errorLocalizer(E, x, ...)

## S3 method for class 'editmatrix'
errorLocalizer(E, x, weight = rep(1, length(x)),
  maxadapt = length(x), maxweight = sum(weight), maxduration = 600,
  tol = sqrt(.Machine$double.eps), ...)

## S3 method for class 'editarray'
errorLocalizer(E, x, weight = rep(1, length(x)),
  maxadapt = length(x), maxweight = sum(weight), maxduration = 600, ...)

## S3 method for class 'editlist'
errorLocalizer(E, x, weight = rep(1, length(x)),
  maxadapt = length(x), maxweight = sum(weight), maxduration = 600, ...)

Arguments

E

an editmatrix or an editarray

x

a named numerical vector or list (if E is an editmatrix), a named character vector or list (if E is an editarray), or a named list if E is an editlist or editset. This is the record for which errors will be localized.

...

Arguments to be passed to other methods (e.g. reliability weights)

weight

a lengt(x) positive weight vector. The weights are assumed to be in the same order as the variables in x.

maxadapt

maximum number of variables to adapt

maxweight

maximum weight of solution, if weights are not given, this is equal to the maximum number of variables to adapt.

maxduration

maximum time (in seconds), for $searchNext(), $searchAll() (not for $searchBest, use $searchBest(maxdration=<duration>) in stead)

tol

tolerance passed to link{isObviouslyInfeasible} (used to check for bound conditions).

Value

an object of class backtracker. Each execution of $searchNext() yields a solution in the form of a list (see details). Executing $searchBest() returns the lowest-weight solution. When multiple solotions with the same weight are found, $searchBest() picks one at random.

Details

Generate a backtracker object for error localization in numerical, categorical, or mixed data. This function generates the workhorse program, called by localizeErrors with method=localizer.

The returned backtracker can be used to run a branch-and-bound algorithm which finds the least (weighted) number of variables in x that need to be adapted so that all restrictions in E can be satisfied. (Generalized principle of Fellegi and Holt (1976)).

The B&B tree is set up so that in in one branche, a variable is assumed correct and its value subsituted in E, while in the other branche a variable is assumed incorrect and eliminated from E. See De Waal (2003), chapter 8 or De Waal, Pannekoek and Scholtus (2011) for a concise description of the B&B algorithm.

Every call to <backtracker>$searchNext() returns one solution list, consisting of

Every subsequent call leads either to NULL, in which case either all solutions have been found, or maxduration was exceeded. The property <backtracker>$maxdurationExceeded indicates if this is the case. Otherwise, a new solution with a weight w not higher than the weight of the last found solution is returned.

Alternatively <backtracker>$searchBest() will return the best solution found within maxduration seconds. If multiple equivalent solutions are found, a random one is returned.

The backtracker is prepared such that missing data in the input record x is already set to adapt, and missing variables have been eliminated already.

The backtracker will crash when E is an editarray and one or more values are not in the datamodel specified by E. The more user-friendly function localizeErrors circumvents this. See also checkDatamodel.

Numerical stability issues

For records with a large numerical range (eg 1-1E9), the error locations represent solutions that will allow repairing the record to within roundoff errors. We highly recommend that you round near-zero values (for example, everything <= sqrt(.Machine$double.eps)) and scale a record with values larger than or equal to 1E9 with a constant factor.

Note

This method is potentially very slow for objects of class editset that contain many conditional restrictions. Consider using localizeErrors with the option method="mip" in such cases.

References

I.P. Fellegi and D. Holt (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association 71, pp 17-25

T. De Waal (2003) Processing of unsave and erroneous data. PhD thesis, Erasmus Research institute of management, Erasmus university Rotterdam. http://www.cbs.nl/nl-NL/menu/methoden/onderzoek-methoden/onderzoeksrapporten/proefschriften/2008-proefschrift-de-waal.htm

T. De Waal, Pannekoek, J. and Scholtus, S. (2011) Handbook of Statistical Data Editing. Wiley Handbooks on Survey Methodology.

See Also

errorLocalizer_mip, localizeErrors, checkDatamodel, violatedEdits,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#### examples with numerical edits
# example with a single editrule
# p = profit, c = cost, t = turnover
E <- editmatrix(c("p + c == t"))
cp <- errorLocalizer(E, x=c(p=755, c=125, t=200))
# x obviously violates E. With all weights equal, changing any variable will do.
# first solution:
cp$searchNext()
# second solution:
cp$searchNext()
# third solution:
cp$searchNext()
# there are no more solution since changing more variables would increase the
# weight, so the result of the next statement is NULL:
cp$searchNext()

# Increasing the reliability weight of turnover, yields 2 solutions:
cp <- errorLocalizer(E, x=c(p=755, c=125, t=200), weight=c(1,1,2))
# first solution:
cp$searchNext()
# second solution:
cp$searchNext()
# no more solutions available:
cp$searchNext()


# A case with two restrictions. The second restriction demands that
# c/t >= 0.6 (cost should be more than 60% of turnover)
E <- editmatrix(c(
        "p + c == t",
        "c - 0.6*t >= 0"))
cp <- errorLocalizer(E,x=c(p=755,c=125,t=200))
# Now, there's only one solution, but we need two runs to find it (the 1st one
# has higher weight)
cp$searchNext()
cp$searchNext()

# With the searchBest() function, the lowest weifght solution is found at once:
errorLocalizer(E,x=c(p=755,c=125,t=200))$searchBest()


# An example with missing data.
E <- editmatrix(c(
    "p + c1 + c2 == t",
    "c1 - 0.3*t >= 0",
    "p > 0",
    "c1 > 0",
    "c2 > 0",
    "t > 0"))
cp <- errorLocalizer(E,x=c(p=755, c1=50, c2=NA,t=200))
# (Note that e2 is violated.)
# There are two solutions. Both demand that c2 is adapted:
cp$searchNext()
cp$searchNext()

##### Examples with categorical edits
# 
# 3 variables, recording age class, position in household, and marital status:
# We define the datamodel and the rules
E <- editarray(expression(
    age %in% c('under aged','adult'),
    maritalStatus %in% c('unmarried','married','widowed','divorced'),
    positionInHousehold %in% c('marriage partner', 'child', 'other'),
    if( age == 'under aged' ) 
        maritalStatus == 'unmarried',
    if( maritalStatus %in% c('married','widowed','divorced')) 
        !positionInHousehold %in% c('marriage partner','child')
    )
)
E

# Let's define a record with an obvious error:
r <- c(
  age = 'under aged', 
  maritalStatus='married', 
  positionInHousehold='child')
# The age class and position in household are consistent, while the marital
# status conflicts.  Therefore, changing only the marital status (in stead of
# both age class and postition in household) seems reasonable. 
el <- errorLocalizer(E,r)
el$searchNext()

editrules documentation built on May 1, 2019, 6:32 p.m.