In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
options(digits = 4, width = 160)

hmatch: Tools for cleaning and matching hierarchically-structured data

An R package for cleaning and matching messy hierarchically-structured data (e.g. country / region / district / municipality). The general goal is to match sets of hierarchical values in a raw dataset to corresponding values within a reference dataset, while accounting for potential discrepancies such as:

variation in character case, punctuation, spacing, use of accents, or spelling
variation in hierarchical resolution (e.g. some entries specified to municipality-level but others only to region)
missing values at one or more hierarchical levels
values entered at the wrong hierarchical level

Installation

Install from GitHub with:

# install.packages("remotes")
remotes::install_github("epicentre-msf/hmatch")

Matching strategies

Low-level

hmatch: match hierarchical sequences up to the highest-resolution level specified within a given row of raw data, optionally allowing for missing values below the match level, and fuzzy matches (using the stringdist package)

Higher-level

hmatch_tokens: match tokens rather than entire strings to allow for variation in multi-term names
hmatch_permute: sequentially permute hierarchical columns to allow for values entered at the wrong level
hmatch_parents: match values at a given hierarchical level based on shared sets of 'offspring'
hmatch_settle: try matching at every level and settle for the highest-resolution match possible
hmatch_manual: match using a user-supplied dictionary
hmatch_split: implement any other hmatch_ function separately at each hierarchical level, only on unique sequences
hmatch_composite: implement a variety of matching strategies in sequence, from most to least strict

String standardization

Independent of optional fuzzy matching with stringdist, hmatch functions use behind-the-scenes string standardization to help account for variation in character case, punctuation, spacing, or use of accents between the raw and reference data. E.g.

              raw_value       reference_value  match
----------------------------------------------------
original:     ILE DE  FRANCE  Île-de-France    FALSE
standardized: ile_de_france   ile_de_france    TRUE

Users can choose default standardization (illustrated above), no standardization, or supply their own preferred function to standardize strings (e.g. tolower).

Usage

Example dataset

The hmatch package contains example datasets ne_raw (messy geographical data) and ne_ref (reference data derived from a shapefile), based on a small subset of northeastern North America.

library(hmatch)

head(ne_raw) # raw messy data

head(ne_ref) # reference data derived from shapefile

Example workflow

Basic hierarchical matching with `hmatch()`

We'll start with a simple call to hmatch to see which rows can be matched with no extra magic.

hmatch(ne_raw, ne_ref, pattern = "^adm")

There are still quite a few unmatched rows, and entry 'PID14' actually matches two different rows within ref, so we'll press on. We can separate the matched and unmatched rows using inner- and anti-joins respectively, specifically using the "resolve_" join type here to only consider matches that are unique.

(raw_match1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner"))

(raw_remain1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_anti"))

Fuzzy matching

Next we'll add in fuzzy-matching, using the default maximum string-distance of 1.

hmatch(raw_remain1, ne_ref, pattern = "^adm", fuzzy = TRUE, type = "inner")

Only one additional unique match, so we'll again split and move on. Note that we've been using the pattern argument above to specify the hierarchical columns in raw and ref, but because the hierarchical columns have the same names in raw and ref (and are the only matching column names), we can drop the pattern argument for brevity.

(raw_match2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_inner"))

(raw_remain2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_anti"))

Tokenized matching

Next let's try hmatch_tokens, which matches based on components of strings (i.e. tokens) rather than entire strings.

(raw_match3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_inner"))

(raw_remain3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_anti"))

Permutation matching

If there are any values entered at the wrong hierarchical level, we can try systematically permuting the hierarchical columns before matching.

(raw_match4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_inner"))

(raw_remain4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_anti"))

The toughest cases

For the remaining rows that we haven't yet matched, there a few options. We could use hmatch_settle() to settle for matches below the highest-resolution level specified within a given row of raw. We could also do some 'manual' comparison of the raw and reference datasets and create a dictionary to recode values within raw to match corresponding entries in ref. Here we'll do both.

ne_dict <- data.frame(
  value = "NJ",
  replacement = "New Jersey",
  variable = "adm1"
)

(raw_match5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
                             fuzzy = TRUE, type = "resolve_inner"))

(raw_remain5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
                              fuzzy = TRUE, type = "resolve_anti"))

epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch: Tools for cleaning and matching hierarchically-structured data

Installation

Matching strategies

Low-level

Higher-level

String standardization

Usage

Example dataset

Example workflow

Basic hierarchical matching with `hmatch()`

Fuzzy matching

Tokenized matching

Permutation matching

The toughest cases

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch Tools for Cleaning and Matching Hierarchically-Structured Data

In epicentre-msf/hmatch: Tools for Cleaning and Matching Hierarchically-Structured Data

hmatch: Tools for cleaning and matching hierarchically-structured data

Installation

Matching strategies

Low-level

Higher-level

String standardization

Usage

Example dataset

Example workflow

Basic hierarchical matching with hmatch()

Fuzzy matching

Tokenized matching

Permutation matching

The toughest cases

R Package Documentation

Browse R Packages

We want your feedback!

epicentre-msf/hmatch
Tools for Cleaning and Matching Hierarchically-Structured Data

Basic hierarchical matching with `hmatch()`