pattern_join | R Documentation |
pattern_join
and similarity_join
join two data.frame
objects based on regex patterns or similarities to a reference, respectively. The first data.frame
contains a dirty column (i.e. "real-world" data originating from a free text field) that needs grouping, categorizing, or classifying based on its content.
The second data.frame
maps its rows to above dirty column. It achieves this using the unique patterns (pattern_join
) or references (similarity_join
) given in
one of its own columns (specified by parameter by
, see below). common_join
is a "common trunk" used by both
pattern_join
and similarity_join
and can be used to create custom join functions (provided a custom matcher
function is given, see below).
common_join(
x,
y,
by,
nomatch_cutoff = 0.2,
x_split_cutoff = 500,
multicore = TRUE,
matcher = NULL
)
pattern_join(
x,
y,
by,
nomatch_cutoff = 0.2,
x_split_cutoff = 500,
multicore = TRUE
)
similarity_join(
x,
y,
by,
nomatch_cutoff = 0.4,
x_split_cutoff = 500,
multicore = TRUE
)
x |
the first |
y |
the second |
by |
character of length 1, specifying either names of corresponding field names in a
named (e.g. |
nomatch_cutoff |
used by |
x_split_cutoff |
integer specifying number of rows above which |
multicore |
logical specifying if multiple cores should be used or not; it defaults to |
matcher |
to create a custom join function using |
a tibble
of merged x
and y
based on found similarities columns specified by argument by
.
pattern_join
is similar to regex_join
# pattern_join 'airplanes' with 'model_type' by columns 'model' and 'pattern'
airplanes_model_type <- pattern_join(airplanes, model_type, c("model" = "pattern"), multicore = FALSE)
# test data for similarity_join
dirty <- data.frame(sample = 1:6, description = c("Bergerx", "Mueler", "Horsst", "Kinga", "Mannn", "Schneemann"))
reference <- data.frame(reference = c("Berger", "Mueller", "Horst", "King", "Mann", "Mustermann"))
# similarity_join with default nomatch_cutoff
dirty %>% similarity_join(reference, by = c("description" = "reference"), multicore = FALSE)
# to avoid mapping "Schneemann" to "Mustermann", increase nomatch_cutoff (default 0.4) to at least 0.51
dirty %>% similarity_join(reference, by = c("description" = "reference"), nomatch_cutoff = 0.51, multicore = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.