locate_errors: Find errors in data

locate_errorsR Documentation

Find errors in data

Description

Find out which fields in a data.frame are "faulty" using validation rules This method returns found errors, according to the specified method x. Use method replace_errors(), to automatically remove these errors. '

Usage

locate_errors(
  data,
  x,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

## S4 method for signature 'data.frame,validator'
locate_errors(
  data,
  x,
  weight = NULL,
  ref = NULL,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

## S4 method for signature 'data.frame,ErrorLocalizer'
locate_errors(
  data,
  x,
  weight = NULL,
  ref = NULL,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

Arguments

data

data to be checked

x

validation rules or errorlocalizer object to be used for finding possible errors.

...

optional parameters that are passed to lpSolveAPI::lp.control() (see details)

cl

optional parallel / cluster.

Ncpus

number of nodes to use. See details

timeout

maximum number of seconds that the localizer should use per record.

weight

numeric optional weight specification to be used in the error localization (see expand_weights()).

ref

data.frame optional reference data to be used in the rules checking

Details

Use an Inf weight specification to fixate variables that can not be changed. See expand_weights() for more details.

locate_errors uses lpSolveAPI to formulate and solves a mixed integer problem. For details see the vignettes. This solver has many options: lpSolveAPI::lp.control.options. Noteworthy options to be used are:

  • timeout: restricts the time the solver spends on a record (seconds)

  • break.at.value: set this to minimum weight + 1 to improve speed.

  • presolve: default for errorlocate is "rows". Set to "none" when you have solutions where all variables are deemed wrong.

locate_errors can be run on multiple cores using R package parallel.

  • The easiest way to use the parallel option is to set Ncpus to the number of desired cores, @seealso parallel::detectCores().

  • Alternatively one can create a cluster object (parallel::makeCluster()) and use cl to pass the cluster object.

  • Or set cl to an integer which results in parallel::mclapply(), which only works on non-windows.

Value

errorlocation-class() object describing the errors found.

See Also

Other error finding: errorlocation-class, errors_removed(), expand_weights(), replace_errors()

Examples

rules <- validator( profit + cost == turnover
                  , cost >= 0.6 * turnover # cost should be at least 60% of turnover
                  , turnover >= 0 # can not be negative.
                  )
data <- data.frame( profit   = 755
                  , cost     = 125
                  , turnover = 200
                  )
le <- locate_errors(data, rules)

print(le)
summary(le)

v_categorical <- validator( branch %in% c("government", "industry")
                          , tax %in% c("none", "VAT")
                          , if (tax == "VAT") branch == "industry"
)

data <- read.csv(text=
"   branch, tax
government, VAT
industry  , VAT
", strip.white = TRUE)
locate_errors(data, v_categorical)$errors

v_logical <- validator( citizen %in% c(TRUE, FALSE)
                      , voted %in% c(TRUE, FALSE)
                      ,  if (voted == TRUE) citizen == TRUE
                      )

data <- data.frame(voted = TRUE, citizen = FALSE)
locate_errors(data, v_logical, weight=c(2,1))$errors

# try a condinational rule
v <- validator( married %in% c(TRUE, FALSE)
              , if (married==TRUE) age >= 17
              )
data <- data.frame( married = TRUE, age = 16)
locate_errors(data, v, weight=c(married=1, age=2))$errors


# different weights per row
data <- read.csv(text=
"married, age
    TRUE,  16
    TRUE,  14
", strip.white = TRUE)

weight <- read.csv(text=
"married, age
       1,   2
       2,   1
", strip.white = TRUE)

locate_errors(data, v, weight = weight)$errors

# fixate / exclude a variable from error localiziation
# using an Inf weight
weight <- c(age = Inf)
locate_errors(data, v, weight = weight)$errors

errorlocate documentation built on Oct. 1, 2023, 1:08 a.m.