The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
You can install {matchmaker} from CRAN:
install.packages("matchmaker")
The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec()
will translate the values in a single vectormatch_df()
will translate values in all specified columns of a
data frameEach of these functions have four manditory options:
x
: your data. This will be a vector or data frame depending on the
function.dictionary
: This is a data frame with at least two columns
specifying keys and values to modifyfrom
: a character or number specifying which column contains the
keysto
: a character or number specifying which column contains the
valuesMostly, users will be working with match_df()
to transform values
across specific columns. A typical workflow would be to:
library("matchmaker")
# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)
# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE
)
This is the top of our data set, generated for example purposes
| id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup | | :----- | :--------- | :---------- | ------: | :------- | ---------: | :-------------- | :-------------- | :-------------- | :------------ | :------- | | ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u | | e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui | | b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | | oui | | c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui | | 40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n | | 46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |
The dictionary looks like this:
| options | values | grp | orders | | :------- | :----------- | :-------------------- | -----: | | y | Yes | readmission | 1 | | n | No | readmission | 2 | | u | Unknown | readmission | 3 | | .missing | Missing | readmission | 4 | | 0 | Yes | treated | 1 | | 1 | No | treated | 2 | | .missing | Missing | treated | 3 | | 1 | Facility 1 | facility | 1 | | 2 | Facility 2 | facility | 2 | | 3 | Facility 3 | facility | 3 | | 4 | Facility 4 | facility | 4 | | 5 | Facility 5 | facility | 5 | | 6 | Facility 6 | facility | 6 | | 7 | Facility 7 | facility | 7 | | 8 | Facility 8 | facility | 8 | | 9 | Facility 9 | facility | 9 | | 10 | Facility 10 | facility | 10 | | .default | Unknown | facility | 11 | | 0 | 0-9 | age_group | 1 | | 10 | 10-19 | age_group | 2 | | 20 | 20-29 | age_group | 3 | | 30 | 30-39 | age_group | 4 | | 40 | 40-49 | age_group | 5 | | 50 | 50+ | age_group | 6 | | high | High | .regex ^lab_result_ | 1 | | norm | Normal | .regex ^lab_result_ | 2 | | inc | Inconclusive | .regex ^lab_result_ | 3 | | y | yes | .global | Inf | | n | no | .global | Inf | | u | unknown | .global | Inf | | unk | unknown | .global | Inf | | oui | yes | .global | Inf | | .missing | missing | .global | Inf |
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp"
)
head(cleaned)
#> id date readmission treated facility age_group
#> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19
#> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19
#> 3 b72883 2019-07-07 Yes No Facility 8 30-39
#> 4 c9ee86 2019-07-09 No No Facility 4 40-49
#> 5 40bc7a 2019-07-12 No No Facility 6 0-9
#> 6 46566e 2019-07-14 Yes Missing Unknown 50+
#> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1 unknown High Inconclusive missing unknown
#> 2 Inconclusive unknown Normal yes yes
#> 3 Inconclusive Normal Inconclusive missing yes
#> 4 Inconclusive Inconclusive unknown yes yes
#> 5 Normal unknown Normal missing no
#> 6 unknown unknown Inconclusive missing missing
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.