This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).
1 2 3 4 5 6 7 8 9 10
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the wordlists by column in
a column name or position defining words or keys to be replaced
a column name or position defining replacement values
character or integer. If
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
a vector of class definitions for each of the columns. If this
is not provided, the classes will be read from the columns themselves.
Practically, this is used in
By default, this applies the function
clean_spelling() to all
columns specified by the column names listed in
spelling_vars, or, if a
global dictionary is used, this includes all
columns as well.
Spelling variables within
wordlists represent keys that you want to match
to column names in
x (the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
.regex [pattern]: any column whose name is matched by
[pattern] should be an unquoted, valid, PERL-flavored regular expression.
.global: any column (see Section Global wordlists)
A global wordlist is a set of definitions applied to all valid columns of
.global spelling_var: If you want to apply a set of definitions to all
valid columns in addition to specified columns, then you can include a
.global group in the
spelling_var column of your
frame. This is useful for setting up a dictionary of common spelling
errors. NOTE: specific variable definitions will override global
defintions. For example: if you have a column for cardinal directions
and a definiton for
N = North, then the global variable
N = no will
not override that. See Example.
spelling_vars = NULL: If you want your data frame to be applied to
all character/factor columns indiscriminantly, then setting
spelling_vars = NULL will use that wordlist globally.
a data frame with re-defined data based on the dictionary
This function will only parse character and factor columns to protect numeric and Date columns from conversion to character.
Zhian N. Kamvar
matchmaker::match_df(), which this function wraps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
# Read in dictionary and coded date examples -------------------- wordlist <- read.csv(linelist_example("spelling-dictionary.csv"), stringsAsFactors = FALSE) dat <- read.csv(linelist_example("coded-data.csv"), stringsAsFactors = FALSE) dat$date <- as.Date(dat$date) # Clean spelling based on wordlist ------------------------------ wordlist # show the wordlist head(dat) # show the data res1 <- clean_variable_spelling(dat, wordlists = wordlist, from = "options", to = "values", spelling_vars = "grp") head(res1) # You can ensure the order of the factors are correct by specifying # a column that defines order. dat <- lapply(dat, as.factor) as.list(head(dat)) res2 <- clean_variable_spelling(dat, wordlists = wordlist, from = "options", to = "values", spelling_vars = "grp", sort_by = "orders") head(res2) as.list(head(res2))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.