View source: R/check_categorical.R
check_categorical | R Documentation |
Values are compared against a user-provided dictionary specifying the allowed values of each categorical variable, after text standardization to account for minor differences in character case, spacing, and punctuation.
The resulting cleaning dictionary can then be manually reviewed to fill in
appropriate replacement values for each non-valid categorical value, or a
missing-value keyword indicating that the value should be converted to NA
.
check_categorical(
x,
dict_allowed,
dict_clean = NULL,
vars_id = NULL,
col_allowed_var = "variable",
col_allowed_value = "value",
fn = std_text,
allow_na = TRUE,
na = ".na",
populate_na = FALSE,
return_all = FALSE
)
x |
A data frame with one or more columns to check |
dict_allowed |
Dictionary of allowed values for each variable of
interest. Must include columns for "variable" and "value" (the names of
which can be modified with args |
dict_clean |
Optional dictionary of value-replacement pairs (e.g. from a
previous run of this function). Must include columns "variable", "value",
"replacement", and, if specified as an argument, all columns specified by
|
vars_id |
Optional vector of one or more ID columns within If not specified the cleaning dictionary contains one entry for each unique combination of variable and non-valid value. If specified the cleaning dictionary contains one entry for each unique combination of variable, non-valid value, and ID variable. |
col_allowed_var |
Name of column in |
col_allowed_value |
Name of column in |
fn |
Function to standardize raw values in both the dataset and
dictionary prior to comparing, to account for minor variation in character
case, spacing, punctuation, etc. Defaults to |
allow_na |
Logical indicating whether missing values should always be
treated as 'allowed' even if not explicitly specified in |
na |
Keyword to use within column "replacement" for values that should
be converted to |
populate_na |
Logical indicating whether to pre-populate column
"replacement" with values specified by keyword |
return_all |
Logical indicating whether to return all non-valid values
including those already specified in argument |
Data frame representing a dictionary of non-valid values, to be used in a future data cleaning step (after specifying the corresponding replacement values). Columns include:
columns specified in vars_id
, if given
variable
: column name of variable within x
value
: non-valid value
replacement
: correct value that should replace a given non-valid value
new
: logical indicating whether the entry is new (TRUE) or already
specified in argument dict_clean
(<NA>
)
# load example dataset, and dictionary of allowed categorical values
data(ll1)
data(dict_categ1)
# basic output
check_categorical(ll1, dict_allowed = dict_categ1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.