Home

/

CRAN

/

matchmaker

/

In matchmaker: Flexible Dictionary-Based Cleaning

Special Keywords

In addition to strict one-to-one matching, there are a few reserved keywords that will help with correcting data. There are reserved keywords for the 'from', 'to', and 'by' columns in the dictionary.

Keys (`from` column)

The from column of the dictionary will contain the keys that you want to match in your current data set. These are expected to match exactly with the exception of three reserved keywords that start with a full stop:

.regex [pattern]: will replace anything matching [pattern]. This is executed before any other replacements are made. The [pattern] should be an unquoted, valid, PERL-flavored regular expression. Any whitespace padding the regular expression is discarded.
.missing: replaces any blank cells or NA values.
.default: replaces ALL values that are not defined in the dictionary and are not missing.

* Any NA values in the keys will be interpreted as "NA" because it's a common mistake to import the value "NA" to missing in R. If you intend for NA to indicate missing data, replace it with: dictionary[[1]][is.na(dictionary[[1]])] <- ".missing"

Values (`to` column)

The values will replace their respective keys exactly as they are presented with one exception. There is currently one recognised keyword that can be placed in the to column of your dictionary:

.na: Replace keys with missing data. When used in combination with the .missing keyword (in column 1), it can allow you to differentiate between explicit and implicit missing data.

Keyword demonstration

For example, let's say you have the following data set of people asked if they like ice cream:

who <- c("Anakin", "Darth", "R2-D2", "Leia", "C-3PO", "Rey", "Obi-Wan", "Luke", "Chewy", "Owen", "Lando")
icecream <- c(letters[1:3], "NO", "N", "yes", "Y", "n", "n", NA, "")
names(icecream) <- who
icecream

Missing data and default values

You could contstruct a dictionary that has a 1:1 relationship between the keys that looks like this:

my_dict1 <- data.frame(
  keys = c("yes", "Y", "n", "N", "NO", ".missing", ".default"),
  values = c("Yes", "Yes", "No", "No", "No", ".na", "(invalid)"),
  stringsAsFactors = FALSE
)
knitr::kable(my_dict1, caption = "my_dict1")

Once you read in the file (either via read.csv() or readxl::read_excel() if you use excel), you can use it as a dictionary. This dictionary will do three things:

convert iterations of yes/no into Yes and No
convert blank or NA values to explicit missing data.
convert all other values to "(invalid)"

match_vec(icecream, dictionary = my_dict1, from = "keys", to = "values")

Fuzzy matching

Now we have nice, predictable values, but let's say Luke really didn't like ice cream. If he responded "NOOOOOOO" instead of "n", then the dictionary we specified would convert it to "(invalid)":

icecream["Luke"] <- "NOOOOOOO"
match_vec(icecream, dictionary = my_dict1, from = "keys", to = "values")

We can fix this if we use pattern matching. Here we are selecting from any valid spelling of yes/no with trailing letters so that it capture's Luke's extreme objection to ice cream. To do this we add the .regex prefix (note the space after .regex):

my_dict2 <- data.frame(
  keys = c(".regex \\^[Yy][Ee]?[Ss]*$", ".regex \\^[Nn][Oo]*$", ".missing", ".default"),
  values = c("Yes", "No", ".na", "(invalid)"),
  stringsAsFactors = FALSE
)

knitr::kable(my_dict2, caption = "my_dict2", escape = TRUE)
my_dict2$keys <- c(".regex ^[Yy][Ee]?[Ss]*$", ".regex ^[Nn][Oo]*$", ".missing", ".default")

match_vec(icecream, dictionary = my_dict2, from = "keys", to = "values")

The drawback to fuzzy matching is that it will convert things that match the pattern, so be very careful when constructing your keys.

Any scripts or data that you put into this service are public.

matchmaker documentation built on Feb. 22, 2020, 1:11 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

matchmaker
Flexible Dictionary-Based Cleaning

In matchmaker: Flexible Dictionary-Based Cleaning

Special Keywords

Keys (`from` column)

Values (`to` column)

Keyword demonstration

Missing data and default values

Fuzzy matching

Try the matchmaker package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

matchmaker Flexible Dictionary-Based Cleaning

In matchmaker: Flexible Dictionary-Based Cleaning

Special Keywords

Keys (from column)

Values (to column)

Keyword demonstration

Missing data and default values

Fuzzy matching

Try the matchmaker package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

matchmaker
Flexible Dictionary-Based Cleaning

Keys (`from` column)

Values (`to` column)