README.md

matchmaker R package

Lifecycle:
experimental CRAN
status Travis build
status AppVeyor build
status Codecov test
coverage

The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that perform dictionary-based cleaning:

Each of these functions have four manditory options:

Mostly, users will be working with match_df() to transform values across specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on your data
  2. read in your data and dictionary to data frames in R
  3. match!
library("matchmaker")

# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)

# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE
)

Data

This is the top of our data set, generated for example purposes

| id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup | | :----- | :--------- | :---------- | ------: | :------- | ---------: | :-------------- | :-------------- | :-------------- | :------------ | :------- | | ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u | | e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui | | b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | | oui | | c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui | | 40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n | | 46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |

Dictionary

The dictionary looks like this:

| options | values | grp | orders | | :------- | :----------- | :-------------------- | -----: | | y | Yes | readmission | 1 | | n | No | readmission | 2 | | u | Unknown | readmission | 3 | | .missing | Missing | readmission | 4 | | 0 | Yes | treated | 1 | | 1 | No | treated | 2 | | .missing | Missing | treated | 3 | | 1 | Facility 1 | facility | 1 | | 2 | Facility 2 | facility | 2 | | 3 | Facility 3 | facility | 3 | | 4 | Facility 4 | facility | 4 | | 5 | Facility 5 | facility | 5 | | 6 | Facility 6 | facility | 6 | | 7 | Facility 7 | facility | 7 | | 8 | Facility 8 | facility | 8 | | 9 | Facility 9 | facility | 9 | | 10 | Facility 10 | facility | 10 | | .default | Unknown | facility | 11 | | 0 | 0-9 | age_group | 1 | | 10 | 10-19 | age_group | 2 | | 20 | 20-29 | age_group | 3 | | 30 | 30-39 | age_group | 4 | | 40 | 40-49 | age_group | 5 | | 50 | 50+ | age_group | 6 | | high | High | .regex ^lab_result_ | 1 | | norm | Normal | .regex ^lab_result_ | 2 | | inc | Inconclusive | .regex ^lab_result_ | 3 | | y | yes | .global | Inf | | n | no | .global | Inf | | u | unknown | .global | Inf | | unk | unknown | .global | Inf | | oui | yes | .global | Inf | | .missing | missing | .global | Inf |

Matching

# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp"
)
head(cleaned)
#>       id       date readmission treated    facility age_group
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+
#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1       unknown          High  Inconclusive      missing  unknown
#> 2  Inconclusive       unknown        Normal          yes      yes
#> 3  Inconclusive        Normal  Inconclusive      missing      yes
#> 4  Inconclusive  Inconclusive       unknown          yes      yes
#> 5        Normal       unknown        Normal      missing       no
#> 6       unknown       unknown  Inconclusive      missing  missing


Try the matchmaker package in your browser

Any scripts or data that you put into this service are public.

matchmaker documentation built on Feb. 22, 2020, 1:11 a.m.