View source: R/hmatch_permute.R
hmatch_permute | R Documentation |
Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using sequential permutation of the hierarchical columns to allow for values entered at the wrong hierarchical level.
The function calls hmatch
on each possible permutation of the
hierarchical columns, and then combines the results. Rows of raw
yielding
multiple matches to ref
can optionally be resolved using a resolve-type
join (see section Resolve joins below).
hmatch_permute(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
In hmatch_permute
, if argument type
corresponds to a resolve join, rows
of raw
with multiple matches to ref
are resolved to the highest
hierarchical level that is common among all matches (or no match if there is
a conflict at the very first level). E.g.
raw
:
1. | United States | <NA> | New York |
Relevant rows from ref
:
1. | United States | New York | <NA> |
2. | United States | New York | New York |
In a regular join with hmatch_permute
, the single row from raw
(above)
will match both of the depicted rows from ref
. However, in a resolve join
the two matches will resolve to the first row from ref
, because it reflects
the highest hierarchical level that is common to all matches.
data(ne_raw)
data(ne_ref)
hmatch_permute(ne_raw, ne_ref, pattern = "^adm", type = "inner")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.