View source: R/stringdist_join.R
stringdist_join | R Documentation |
Join two tables based on fuzzy string matching of their columns. This is useful, for example, in matching free-form inputs in a survey or online form, where it can catch misspellings and small personal changes.
stringdist_join(
x,
y,
by = NULL,
max_dist = 2,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
mode = "inner",
ignore_case = FALSE,
distance_col = NULL,
...
)
stringdist_inner_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_left_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_right_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_full_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_semi_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_anti_join(x, y, by = NULL, distance_col = NULL, ...)
x |
A tbl |
y |
A tbl |
by |
Columns by which to join the two tables |
max_dist |
Maximum distance to use for joining |
method |
Method for computing string distance, see
|
mode |
One of "inner", "left", "right", "full" "semi", or "anti" |
ignore_case |
Whether to be case insensitive (default yes) |
distance_col |
If given, will add a column with this name containing the difference between the two |
... |
Arguments passed on to |
If method = "soundex"
, the max_dist
is
automatically set to 0.5, since soundex returns either a 0 (match)
or a 1 (no match).
library(dplyr)
library(ggplot2)
data(diamonds)
d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom",
"VeryGood", "VeryGood", "Faiir"),
type = 1:6)
# no matches when they are inner-joined:
diamonds %>%
inner_join(d, by = c(cut = "approximate_name"))
# but we can match when they're fuzzy joined
diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.