View source: R/hmatch_settle.R
hmatch_settle | R Documentation |
Match sets of hierarchical values (e.g. province / county / township) in a
raw, messy dataset to corresponding values within a reference dataset,
sequentially over each hierarchical level. Specifically, implements
hmatch
at each successive hierarchical level, starting with
only the first level (lowest resolution), then first and second, first second
and third, etc.
After the initial matching over all levels, users can optionally use a resolve join to 'settle' for the highest match possible for each row of raw data, even if that match is below the highest-resolution level specified.
hmatch_settle(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In a resolve type join with hmatch_settle
, rows of raw
with multiple
matches to ref
are resolved to the highest hierarchical level that is
non-conflicting among all matches (or no match if there is a conflict at the
very first level). E.g.
raw
:
1. | United States | <NA> | Jefferson |
Relevant rows from ref
:
1. | United States | <NA> | <NA> |
2. | United States | New York | Jefferson |
3. | United States | Pennsylvania | Jefferson |
In a regular join, the single row from raw
(above) will match all three
rows from ref
. However, in a resolve join the multiple matches will be
resolved to the first row from ref
, because only the first hierarchical
level ("United States") is non-conflicting among all possible matches.
Note that there's a distinction between "common" values at a given hierarchical level (i.e. a single unique value in each row) and "non-conflicting" values (i.e. a single unique value or a missing value). E.g.
raw
:
1. | United States | New York | New York |
Relevant rows from ref
:
1. | United States | <NA> | <NA> |
2. | United States | New York | <NA> |
3. | United States | New York | New York |
In the example above, only the 1st hierarchical level ("United States") is
"common" to all matches, but all hierarchical levels are "non-conflicting"
(i.e. because row 2 is a hierarchical child of row 1, and row 3 a child of
row 2), and so a resolve-type match will be made to the 3rd row in ref
.
data(ne_raw)
data(ne_ref)
# return matches at all levels
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner")
# use a resolve join to settle for the best possible match for each row
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.