Description Usage Arguments Details Value Author(s) See Also Examples
This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).
1 2 3 4 5 6 7 8 9 |
x |
a character or factor vector |
dictionary |
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the dictionary by column in |
from |
a column name or position defining words or keys to be replaced |
to |
a column name or position defining replacement values |
by |
character or integer. If |
order |
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn |
if |
By default, this applies the function match_vec()
to all
columns specified by the column names listed in by
, or, if a
global dictionary is used, this includes all character
and factor
columns as well.
by
columnSpelling variables within dictionary
represent keys that you want to match
to column names in x
(the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
.regex [pattern]
: any column whose name is matched by [pattern]
. The
[pattern]
should be an unquoted, valid, PERL-flavored regular expression.
.global
: any column (see Section Global dictionary)
A global dictionary is a set of definitions applied to all valid columns of
x
indiscriminantly.
.global keyword in by
: If you want to apply a set of definitions to
all valid columns in addition to specified columns, then you can include a
.global
group in the by
column of your dictionary
data frame. This is
useful for setting up a dictionary of common spelling errors. NOTE:
specific variable definitions will override global defintions. For
example: if you have a column for cardinal directions and a definiton for
N = North
, then the global variable N = no
will not override that. See
Example.
by = NULL
: If you want your data frame to be applied to
all character/factor columns indiscriminantly, then setting
by = NULL
will use that dictionary globally.
a data frame with re-defined data based on the dictionary
Zhian N. Kamvar
Patrick Barks
match_vec()
, which this function wraps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | # Read in dictionary and coded date examples --------------------
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)
# Clean spelling based on dictionary -----------------------------
dict # show the dict
head(dat) # show the data
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp")
head(res1)
# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
warn = TRUE)
head(res1)
# You can ensure the order of the factors are correct by specifying
# a column that defines order.
dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
order = "orders")
head(res2)
as.list(head(res2))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.