clean_categorical: Clean categorical variables within a dataset based on a...
In epicentre-msf/dbc: Dictionary-Based Cleaning

clean_categorical

R Documentation

Clean categorical variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs to clean and standardize values of categorical variables. Includes options for text standardization to standardize minor differences in character case, spacing, and punctuation.

Usage

clean_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  non_allowed_to_missing = TRUE,
  fn = std_text,
  na = ".na"
)

Arguments

`x`	A data frame with one or more columns to clean
`dict_allowed`	Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args `col_allowed_var` and `col_allowed_value`).
`dict_clean`	Optional dictionary of value-replacement pairs (e.g. produced by `check_categorical`). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by `vars_id`.
`vars_id`	Optional vector of one or more ID columns within `x` on which corrections should be conditional. If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.
`col_allowed_var`	Name of column in `dict_allowed` giving variable name (defaults to "variable")
`col_allowed_value`	Name of column in `dict_allowed` giving allowed values (defaults to "value")
`non_allowed_to_missing`	Logical indicating whether to replace values that remain non-allowed, even after cleaning and standardization, to NA. Defaults to TRUE. If no dictionary is provided, will simply standardize columns to match allowed values specified in `dict_allowed`.
`fn`	Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to `std_text`. To omit the standardization step can use e.g. `as.character` or an identity function `function(x) x`.
`na`	Keyword to use within column "replacement" for values that should be converted to `NA`. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to `NA`.

Value

The original data frame x but with cleaned versions of the categorical variables specified in argument dict_allowed

Examples

# load example dataset, dictionary of allowed categorical values, and
# cleaning dictionary
data(ll1)
data(dict_categ1)
data(clean_categ1)

# dictionary-based corrections to categorical vars
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1
)

# require exact matching, including character case
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1,
  fn = identity
)

# apply standardization to dict_allowed but no additional dict-based cleaning
clean_categorical(
  ll1,
  dict_allowed = dict_categ1
)

epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.