check_numeric: Produce a dictionary of non-valid numeric values within a...

View source: R/check_numeric.R

check_numericR Documentation

Produce a dictionary of non-valid numeric values within a dataset, for use in subsequent data cleaning

Description

The resulting cleaning dictionary can then be manually reviewed to fill in appropriate replacement values for each non-valid numeric value, or a missing-value keyword indicating that the value should be converted to NA.

Usage

check_numeric(
  x,
  vars,
  vars_id = NULL,
  queries = list(),
  dict_clean = NULL,
  fn = as.numeric,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)

Arguments

x

A data frame with one or more columns to check

vars

Names of columns within x to check

vars_id

Optional vector of one or more ID columns within x on which corrections should be conditional.

If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.

queries

Optional list of expressions to check for non-valid values. May include a .x selector which is a stand-in for any of the numeric variables specified in argument vars. E.g.

list(
  age > 110,  # age greater than 110
  .x < 0      # any numeric value less than 0
)
dict_clean

Optional dictionary of value-replacement pairs (e.g. from a previous run of this function). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by vars_id.

fn

Function to convert values to numeric. Defaults to as.numeric.

na

Keyword to use within column "replacement" for values that should be converted to NA. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to NA.

populate_na

Logical indicating whether to pre-populate column "replacement" with values specified by keyword na. If most non-valid values in x are non-correctable, pre-populating the keyword na can save time during the manual verification/correction phase. Defaults to FALSE.

return_all

Logical indicating whether to return all non-valid values including those already specified in argument dict_clean (if specified) (TRUE), or only the new non-valid entries not already specified in dict_clean (FALSE). Defaults to FALSE.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

  • columns specified in vars_id, if given

  • variable: column name of variable within x

  • value: non-valid value

  • replacement: correct value that should replace a given non-valid value

  • new: logical indicating whether the entry is new (TRUE) or already specified in argument dict_clean (⁠<NA>⁠)

Examples

# load example dataset
data(ll1)
data(clean_num1)

# basic output
check_numeric(ll1, c("age", "contacts"))

# include id var "id"
check_numeric(ll1, c("age", "contacts"), vars_id = "id")

# add custom query
check_numeric(ll1, c("age", "contacts"), vars_id = "id", queries = list(age > 90))

# prepopulate column 'replacement'
check_numeric(ll1, c("age", "contacts"), vars_id = "id", populate_na = TRUE)

# use dictionary of pre-specified corrections
check_numeric(ll1, c("age", "contacts"), dict_clean = clean_num1)


epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.