check_categorical: Produce a dictionary of non-valid categorical values within a...

View source: R/check_categorical.R

check_categoricalR Documentation

Produce a dictionary of non-valid categorical values within a dataset, for use in subsequent data cleaning

Description

Values are compared against a user-provided dictionary specifying the allowed values of each categorical variable, after text standardization to account for minor differences in character case, spacing, and punctuation.

The resulting cleaning dictionary can then be manually reviewed to fill in appropriate replacement values for each non-valid categorical value, or a missing-value keyword indicating that the value should be converted to NA.

Usage

check_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  fn = std_text,
  allow_na = TRUE,
  na = ".na",
  populate_na = FALSE,
  return_all = FALSE
)

Arguments

x

A data frame with one or more columns to check

dict_allowed

Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args col_allowed_var and col_allowed_value).

dict_clean

Optional dictionary of value-replacement pairs (e.g. from a previous run of this function). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by vars_id.

vars_id

Optional vector of one or more ID columns within x on which corrections should be conditional.

If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.

col_allowed_var

Name of column in dict_allowed giving variable name (defaults to "variable")

col_allowed_value

Name of column in dict_allowed giving allowed values (defaults to "value")

fn

Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to std_text. To omit the standardization step can use e.g. as.character or an identity function function(x) x.

allow_na

Logical indicating whether missing values should always be treated as 'allowed' even if not explicitly specified in dict_allowed. Defaults to TRUE.

na

Keyword to use within column "replacement" for values that should be converted to NA. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to NA.

populate_na

Logical indicating whether to pre-populate column "replacement" with values specified by keyword na. If most non-valid values in x are non-correctable, pre-populating the keyword na can save time during the manual verification/correction phase. Defaults to FALSE.

return_all

Logical indicating whether to return all non-valid values including those already specified in argument dict_clean (if specified) (TRUE), or only the new non-valid entries not already specified in dict_clean (FALSE). Defaults to FALSE.

Value

Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:

  • columns specified in vars_id, if given

  • variable: column name of variable within x

  • value: non-valid value

  • replacement: correct value that should replace a given non-valid value

  • new: logical indicating whether the entry is new (TRUE) or already specified in argument dict_clean (⁠<NA>⁠)

Examples

# load example dataset, and dictionary of allowed categorical values
data(ll1)
data(dict_categ1)

# basic output
check_categorical(ll1, dict_allowed = dict_categ1)


epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.