hmatch_tokens: Hierarchical matching with tokenization of multi-term values
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_tokens

R Documentation

Hierarchical matching with tokenization of multi-term values

Description

Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").

Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.

Usage

hmatch_tokens(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  always_tokenize = FALSE,
  token_split = "_",
  token_min = 1,
  exclude_freq = 3,
  exclude_nchar = 3,
  exclude_values = NULL,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

`raw`	data frame containing hierarchical columns with raw data
`ref`	data frame containing hierarchical columns with reference data
`pattern`	regex pattern to match the hierarchical columns in `raw` Note: hierarchical column names can be matched using either the `pattern` or `by` arguments. Or, if neither `pattern` or `by` are specified, the hierarchical columns are assumed to be all column names that are common to both `raw` and `ref`. See specifying_columns.
`pattern_ref`	regex pattern to match the hierarchical columns in `ref`. Defaults to `pattern`, so only need to specify if the hierarchical columns have different names in `raw` and `ref`.
`by`	vector giving the names of the hierarchical columns in `raw`
`by_ref`	vector giving the names of the hierarchical columns in `ref`. Defaults to `by`, so only need to specify if the hierarchical columns have different names in `raw` and `ref`.
`type`	type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
`allow_gaps`	logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of `raw`. Defaults to `TRUE`.
`always_tokenize`	logical indicating whether to tokenize all values prior to matching (`TRUE`), or to first attempt non-tokenized matching with `hmatch` and only tokenize values within `raw` (and corresponding putative matches within `ref`) that don't have a non-tokenized match (`FALSE`). Defaults to `FALSE`.
`token_split`	regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument `std_fn` (this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".
`token_min`	minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1.
`exclude_freq`	exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by `count_tokens` (separately for `raw` and `ref`). Defaults to `3`.
`exclude_nchar`	exclude tokens from matching if they have nchar less than or equal to this value. Defaults to `3`.
`exclude_values`	character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument `std_fn`.
`fuzzy`	logical indicating whether to use fuzzy-matching (based on the `stringdist` package). Defaults to FALSE.
`fuzzy_method`	if `fuzzy = TRUE`, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".
`fuzzy_dist`	if `fuzzy = TRUE`, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to `fuzzy_dist` will be considered matching). Defaults to `1L`.
`dict`	optional dictionary for recoding values within the hierarchical columns of `raw` (see dictionary_recoding)
`ref_prefix`	prefix to add to names of returned columns from `ref` if they are otherwise identical to names within `raw`. Defaults to "ref_".
`std_fn`	function to standardize strings during matching. Defaults to `string_std`. Set to `NULL` to omit standardization. See also string_standardization.
`...`	additional arguments passed to `std_fn()`

Value

a data frame obtained by matching the hierarchical columns in raw and ref, using the join type specified by argument type (see join_types for more details)

Resolve joins

Uses the same approach to resolve joins as hmatch.

Examples

data(ne_raw)
data(ne_ref)

# add tokens to some values within ref to illustrate tokenized matching
ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America"
ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State"

hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)

epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.

epicentre-msf/hmatch index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_tokens: Hierarchical matching with tokenization of multi-term values
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Hierarchical matching with tokenization of multi-term values

Description

Usage

Arguments

Value

Resolve joins

Examples

Related to hmatch_tokens in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_tokens: Hierarchical matching with tokenization of multi-term values In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Hierarchical matching with tokenization of multi-term values

Description

Usage

Arguments

Value

Resolve joins

Examples

Related to hmatch_tokens in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_tokens: Hierarchical matching with tokenization of multi-term values
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data