hmatch | R Documentation |
Match sets of hierarchical values (e.g. province, county, township) in a raw, messy dataset to corresponding values within a reference dataset, optionally accounting for discrepancies between the datasets such as:
variation in character case, use of accents, or spelling
variation in hierarchical resolution (e.g. some entries specified to municipality but others only to region)
missing values at one or more hierarchical levels
hmatch(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In hmatch
, if argument type
corresponds to a resolve join, rows
of raw
with multiple matches to ref
are always resolved to 'no match'.
This is because hmatch
does not accept matches below the highest
non-missing level within a given row of raw
. E.g.
raw
:
1. | United States | <NA> | Jefferson |
Relevant rows from ref
:
1. | United States | New York | Jefferson |
2. | United States | Pennsylvania | Jefferson |
In a regular join with hmatch
, the single row from raw
(above)
will match both rows of ref
. However, in a resolve join the multiple
conflicting matches (i.e. conflicting values at the 2nd hierarchical level)
will result in the row from raw
being treated as non-matching to ref
.
data(ne_raw)
data(ne_ref)
hmatch(ne_raw, ne_ref, pattern = "adm", type = "inner")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.