View source: R/hmatch_parents.R
hmatch_parents | R Documentation |
Match a hierarchical column (e.g. region, province, or county) within a raw, potentially messy dataset against a corresponding column within a reference dataset, by searching for similar sets of 'offspring' (i.e. values at the next hierarchical level).
For example, if the raw dataset uses admin1 level "NY" whereas the reference dataset uses "New York", it would be difficult to automatically match these values using only fuzzy-matching. However, we might nonetheless be able to match "NY" to "New York" if they share a common and unique set of 'offspring' (i.e. admin2 values) across both datasets (e.g "Kings", "Queens", "New York", "Suffolk", "Bronx", etc.).
Unlike other hmatch
functions, the data frame returned by hmatch_parents
only includes unique hierarchical combinations and only relevant
hierarchical levels (i.e. the parent level and above), along with additional
columns giving the number of matching children and total number of children
for a given parent.
hmatch_parents(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
level,
min_matches = 1L,
type = "left",
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
ref_prefix = "ref_",
std_fn = string_std,
...
)
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
level |
name or integer index of the hierarchical level to match at
(i.e. the 'parent' level). If a name, must correspond to a hierarchical
column within |
min_matches |
minimum number of matching offspring required for parents
to be considered a match. Defaults to |
type |
type of join ("left", "inner" or "anti") (defaults to "left") |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
(at the parent level and above), using the join type specified by
argument type
(see join_types for more details). Note that unlike
other hmatch_
functions, hmatch_parents returns only unique rows and
relevant hierarchical columns (i.e. the parent level and above), along with
additional columns describing the number of matching children and total
number of children for a given parent.
... |
hierarchical columns from |
... |
hierarchical columns from |
n_child_raw |
total number of unique children belonging to the parent within |
n_child_ref |
total number of unique children belonging to the parent within |
n_child_match |
number of children in |
# e.g. match abbreviated adm1 names to full names based on common offspring
raw <- ne_ref
raw$adm1[raw$adm1 == "Ontario"] <- "ON"
raw$adm1[raw$adm1 == "New York"] <- "NY"
raw$adm1[raw$adm1 == "New Jersey"] <- "NJ"
raw$adm1[raw$adm1 == "Pennsylvania"] <- "PA"
hmatch_parents(
raw,
ne_ref,
pattern = "adm",
level = "adm1",
min_matches = 2,
type = "left"
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.