View source: R/hmatch_tokens.R
hmatch_tokens | R Documentation |
Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").
Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.
hmatch_tokens(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
always_tokenize = FALSE,
token_split = "_",
token_min = 1,
exclude_freq = 3,
exclude_nchar = 3,
exclude_values = NULL,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
raw |
data frame containing hierarchical columns with raw data |
ref |
data frame containing hierarchical columns with reference data |
pattern |
regex pattern to match the hierarchical columns in Note: hierarchical column names can be matched using either the |
pattern_ref |
regex pattern to match the hierarchical columns in |
by |
vector giving the names of the hierarchical columns in |
by_ref |
vector giving the names of the hierarchical columns in |
type |
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. |
allow_gaps |
logical indicating whether to allow missing values below
the match level, where 'match level' is the highest level with a
non-missing value within a given row of |
always_tokenize |
logical indicating whether to tokenize all values
prior to matching ( |
token_split |
regex pattern to split strings into tokens. Currently
tokenization is implemented after
string-standardizatipn with argument
|
token_min |
minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1. |
exclude_freq |
exclude tokens from matching if they have a frequency
greater than or equal to this value. Refers to the number of unique,
string-standardized values at a given hierarchical level in which a given
token occurs, as calculated by |
exclude_nchar |
exclude tokens from matching if they have nchar
less than or equal to this value. Defaults to |
exclude_values |
character vector of additional tokens to exclude from
matching. Subject to string-standardizatipn
with argument |
fuzzy |
logical indicating whether to use fuzzy-matching (based on the
|
fuzzy_method |
if |
fuzzy_dist |
if |
dict |
optional dictionary for recoding values within the hierarchical
columns of |
ref_prefix |
prefix to add to names of returned columns from |
std_fn |
function to standardize strings during matching. Defaults to
|
... |
additional arguments passed to |
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Uses the same approach to resolve joins as hmatch
.
data(ne_raw)
data(ne_ref)
# add tokens to some values within ref to illustrate tokenized matching
ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America"
ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State"
hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.