clean_categorical: Clean categorical variables within a dataset based on a...

View source: R/clean_categorical.R

clean_categoricalR Documentation

Clean categorical variables within a dataset based on a dictionary of value-replacement pairs

Description

Applies a dictionary of value-replacement pairs to clean and standardize values of categorical variables. Includes options for text standardization to standardize minor differences in character case, spacing, and punctuation.

Usage

clean_categorical(
  x,
  dict_allowed,
  dict_clean = NULL,
  vars_id = NULL,
  col_allowed_var = "variable",
  col_allowed_value = "value",
  non_allowed_to_missing = TRUE,
  fn = std_text,
  na = ".na"
)

Arguments

x

A data frame with one or more columns to clean

dict_allowed

Dictionary of allowed values for each variable of interest. Must include columns for "variable" and "value" (the names of which can be modified with args col_allowed_var and col_allowed_value).

dict_clean

Optional dictionary of value-replacement pairs (e.g. produced by check_categorical). Must include columns "variable", "value", "replacement", and, if specified as an argument, all columns specified by vars_id.

vars_id

Optional vector of one or more ID columns within x on which corrections should be conditional.

If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable.

col_allowed_var

Name of column in dict_allowed giving variable name (defaults to "variable")

col_allowed_value

Name of column in dict_allowed giving allowed values (defaults to "value")

non_allowed_to_missing

Logical indicating whether to replace values that remain non-allowed, even after cleaning and standardization, to NA. Defaults to TRUE.

If no dictionary is provided, will simply standardize columns to match allowed values specified in dict_allowed.

fn

Function to standardize raw values in both the dataset and dictionary prior to comparing, to account for minor variation in character case, spacing, punctuation, etc. Defaults to std_text. To omit the standardization step can use e.g. as.character or an identity function function(x) x.

na

Keyword to use within column "replacement" for values that should be converted to NA. Defaults to ".na". The keyword is used to distinguish between "replacement" values that are missing because they have yet to be manually verified, and values that have been verified and really should be converted to NA.

Value

The original data frame x but with cleaned versions of the categorical variables specified in argument dict_allowed

Examples

# load example dataset, dictionary of allowed categorical values, and
# cleaning dictionary
data(ll1)
data(dict_categ1)
data(clean_categ1)

# dictionary-based corrections to categorical vars
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1
)

# require exact matching, including character case
clean_categorical(
  ll1,
  dict_allowed = dict_categ1,
  dict_clean = clean_categ1,
  fn = identity
)

# apply standardization to dict_allowed but no additional dict-based cleaning
clean_categorical(
  ll1,
  dict_allowed = dict_categ1
)


epicentre-msf/dbc documentation built on Oct. 24, 2023, 9:25 p.m.