hmatch_settle: Sequential hierarchical matching at each hierarchical level,...
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_settle

R Documentation

Sequential hierarchical matching at each hierarchical level, settling for the highest resolution match that is possible for each row

Description

Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, sequentially over each hierarchical level. Specifically, implements hmatch at each successive hierarchical level, starting with only the first level (lowest resolution), then first and second, first second and third, etc.

After the initial matching over all levels, users can optionally use a resolve join to 'settle' for the highest match possible for each row of raw data, even if that match is below the highest-resolution level specified.

Usage

hmatch_settle(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

`raw`	data frame containing hierarchical columns with raw data
`ref`	data frame containing hierarchical columns with reference data
`pattern`	regex pattern to match the hierarchical columns in `raw` Note: hierarchical column names can be matched using either the `pattern` or `by` arguments. Or, if neither `pattern` or `by` are specified, the hierarchical columns are assumed to be all column names that are common to both `raw` and `ref`. See specifying_columns.
`pattern_ref`	regex pattern to match the hierarchical columns in `ref`. Defaults to `pattern`, so only need to specify if the hierarchical columns have different names in `raw` and `ref`.
`by`	vector giving the names of the hierarchical columns in `raw`
`by_ref`	vector giving the names of the hierarchical columns in `ref`. Defaults to `by`, so only need to specify if the hierarchical columns have different names in `raw` and `ref`.
`type`	type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
`allow_gaps`	logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of `raw`. Defaults to `TRUE`.
`fuzzy`	logical indicating whether to use fuzzy-matching (based on the `stringdist` package). Defaults to FALSE.
`fuzzy_method`	if `fuzzy = TRUE`, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".
`fuzzy_dist`	if `fuzzy = TRUE`, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to `fuzzy_dist` will be considered matching). Defaults to `1L`.
`dict`	optional dictionary for recoding values within the hierarchical columns of `raw` (see dictionary_recoding)
`ref_prefix`	prefix to add to names of returned columns from `ref` if they are otherwise identical to names within `raw`. Defaults to "ref_".
`std_fn`	function to standardize strings during matching. Defaults to `string_std`. Set to `NULL` to omit standardization. See also string_standardization.
`...`	additional arguments passed to `std_fn()`

Value

a data frame obtained by matching the hierarchical columns in raw and ref, using the join type specified by argument type (see join_types for more details)

Resolve joins

In a resolve type join with hmatch_settle, rows of raw with multiple matches to ref are resolved to the highest hierarchical level that is non-conflicting among all matches (or no match if there is a conflict at the very first level). E.g.

raw:
⁠1. | United States | <NA> | Jefferson |⁠

In a regular join, the single row from raw (above) will match all three rows from ref. However, in a resolve join the multiple matches will be resolved to the first row from ref, because only the first hierarchical level ("United States") is non-conflicting among all possible matches.

Note that there's a distinction between "common" values at a given hierarchical level (i.e. a single unique value in each row) and "non-conflicting" values (i.e. a single unique value or a missing value). E.g.

raw:
⁠1. | United States | New York | New York |⁠

Relevant rows from ref:
⁠1. | United States | <NA> | <NA> |⁠
⁠2. | United States | New York | <NA> |⁠
⁠3. | United States | New York | New York |⁠

In the example above, only the 1st hierarchical level ("United States") is "common" to all matches, but all hierarchical levels are "non-conflicting" (i.e. because row 2 is a hierarchical child of row 1, and row 3 a child of row 2), and so a resolve-type match will be made to the 3rd row in ref.

Examples

data(ne_raw)
data(ne_ref)

# return matches at all levels
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner")

# use a resolve join to settle for the best possible match for each row
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")

epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.

epicentre-msf/hmatch index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_settle: Sequential hierarchical matching at each hierarchical level,...
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Sequential hierarchical matching at each hierarchical level, settling for the highest resolution match that is possible for each row

Description

Usage

Arguments

Value

Resolve joins

Examples

Related to hmatch_settle in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_settle: Sequential hierarchical matching at each hierarchical level,... In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

Sequential hierarchical matching at each hierarchical level, settling for the highest resolution match that is possible for each row

Description

Usage

Arguments

Value

Resolve joins

Examples

Related to hmatch_settle in epicentre-msf/hmatch...

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch_settle: Sequential hierarchical matching at each hierarchical level,...
In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data