check_categorical: Produce a dictionary of non-valid categorical values within a...
In epicentre-msf/dbc: Dictionary-Based Cleaning

check_categorical

R Documentation

Produce a dictionary of non-valid categorical values within a dataset, for use in subsequent data cleaning

Description

Values are compared against a user-provided dictionary specifying the allowed values of each categorical variable, after text standardization to account for minor differences in character case, spacing, and punctuation.

The resulting cleaning dictionary can then be manually reviewed to fill in appropriate replacement values for each non-valid categorical value, or a missing-value keyword indicating that the value should be converted to NA.

Usage

check_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  fn = std_text,
  allow_na = TRUE,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)

Arguments

`x`	A data frame with one or more columns to check
`dict_allowed`	Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args `col_allowed_var` and `col_allowed_value`).
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. from a previous run of this function). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`.
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`col_allowed_var`	Name of column in `dict_allowed` giving variable name (defaults to "variable")
`col_allowed_value`	Name of column in `dict_allowed` giving allowed values (defaults to "value")
`fn`	Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to `std_text`. To omit the standardization step can use e.g. `as.character` or an identity function `function(x) x`.
`allow_na`	Logical indicating whether missing values should always be treated as 'allowed' even if not explicitly specified in `dict_allowed`. Defaults to `TRUE`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.
`populate_na`	Logical indicating whether to pre-populate column "replacement" with values specified by keyword `na`. If most non-valid values in `x` are non-correctable, pre-populating the keyword `na` can save time during the manual verification/correction phase. Defaults to `FALSE`.
`return_all`	Logical indicating whether to return all non-valid values including those already specified in argument `dict_clean` (if specified) (`TRUE`), or only the new non-valid entries not already specified in `dict_clean` (`FALSE`). Defaults to `FALSE`.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

columns specified in vars_id, if given
variable: column name of variable within x
value: non-valid value
replacement: correct value that should replace a given non-valid value
new: logical indicating whether the entry is new (TRUE) or already specified in argument dict_clean (⁠<NA>⁠)

Examples

# load example dataset, and dictionary of allowed categorical values
data(ll1)
data(dict_categ1)

# basic output
check_categorical(ll1, dict_allowed = dict_categ1)

epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.