match_df: Check and clean spelling or codes of multiple variables in a...
In matchmaker: Flexible Dictionary-Based Cleaning

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/match_df.R

This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

match_df(
  x = data.frame(),
  dictionary = list(),
  from = 1,
  to = 2,
  by = 3,
  order = NULL,
  warn = FALSE
)

`x`	a character or factor vector
`dictionary`	a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the dictionary by column in `x` (see `by`).
`from`	a column name or position defining words or keys to be replaced
`to`	a column name or position defining replacement values
`by`	character or integer. If `dictionary` is a data frame, then this column in defines the columns in `x` corresponding to each section of the `dictionary` data frame. This defaults to `3`, indicating the third column is to be used.
`order`	a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
`warn`	if `TRUE`, warnings and errors from `match_vec()` will be shown as a single warning. Defaults to `FALSE`, which shows nothing.

By default, this applies the function match_vec() to all columns specified by the column names listed in by, or, if a global dictionary is used, this includes all character and factor columns as well.

`by` column

Spelling variables within dictionary represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

.regex [pattern]: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression.
.global: any column (see Section Global dictionary)

Global dictionary

A global dictionary is a set of definitions applied to all valid columns of x indiscriminantly.

.global keyword in by: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global group in the by column of your dictionary data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North, then the global variable N = no will not override that. See Example.
by = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then setting by = NULL will use that dictionary globally.

a data frame with re-defined data based on the dictionary

Zhian N. Kamvar

Patrick Barks

match_vec(), which this function wraps.

# Read in dictionary and coded date examples --------------------

dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

# Clean spelling based on dictionary -----------------------------

dict # show the dict
head(dat) # show the data

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp")
head(res1)

# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  warn = TRUE)
head(res1)


# You can ensure the order of the factors are correct by specifying
# a column that defines order.

dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  order = "orders")
head(res2)
as.list(head(res2))