knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
options(digits = 4, width = 160)

hmatch: Tools for cleaning and matching hierarchically-structured data

Lifecycle: maturing R-CMD-check Codecov test coverage

An R package for cleaning and matching messy hierarchically-structured data (e.g. country / region / district / municipality). The general goal is to match sets of hierarchical values in a raw dataset to corresponding values within a reference dataset, while accounting for potential discrepancies such as:

Installation

Install from GitHub with:

# install.packages("remotes")
remotes::install_github("epicentre-msf/hmatch")

Matching strategies

Low-level
Higher-level

String standardization

Independent of optional fuzzy matching with stringdist, hmatch functions use behind-the-scenes string standardization to help account for variation in character case, punctuation, spacing, or use of accents between the raw and reference data. E.g.

              raw_value       reference_value  match
----------------------------------------------------
original:     ILE DE  FRANCE  Île-de-France    FALSE
standardized: ile_de_france   ile_de_france    TRUE

Users can choose default standardization (illustrated above), no standardization, or supply their own preferred function to standardize strings (e.g. tolower).

Usage

Example dataset

The hmatch package contains example datasets ne_raw (messy geographical data) and ne_ref (reference data derived from a shapefile), based on a small subset of northeastern North America.

library(hmatch)

head(ne_raw) # raw messy data

head(ne_ref) # reference data derived from shapefile

Example workflow

Basic hierarchical matching with hmatch()

We'll start with a simple call to hmatch to see which rows can be matched with no extra magic.

hmatch(ne_raw, ne_ref, pattern = "^adm")

There are still quite a few unmatched rows, and entry 'PID14' actually matches two different rows within ref, so we'll press on. We can separate the matched and unmatched rows using inner- and anti-joins respectively, specifically using the "resolve_" join type here to only consider matches that are unique.

(raw_match1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner"))

(raw_remain1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_anti"))
Fuzzy matching

Next we'll add in fuzzy-matching, using the default maximum string-distance of 1.

hmatch(raw_remain1, ne_ref, pattern = "^adm", fuzzy = TRUE, type = "inner")

Only one additional unique match, so we'll again split and move on. Note that we've been using the pattern argument above to specify the hierarchical columns in raw and ref, but because the hierarchical columns have the same names in raw and ref (and are the only matching column names), we can drop the pattern argument for brevity.

(raw_match2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_inner"))

(raw_remain2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_anti"))
Tokenized matching

Next let's try hmatch_tokens, which matches based on components of strings (i.e. tokens) rather than entire strings.

(raw_match3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_inner"))

(raw_remain3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_anti"))
Permutation matching

If there are any values entered at the wrong hierarchical level, we can try systematically permuting the hierarchical columns before matching.

(raw_match4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_inner"))

(raw_remain4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_anti"))
The toughest cases

For the remaining rows that we haven't yet matched, there a few options. We could use hmatch_settle() to settle for matches below the highest-resolution level specified within a given row of raw. We could also do some 'manual' comparison of the raw and reference datasets and create a dictionary to recode values within raw to match corresponding entries in ref. Here we'll do both.

ne_dict <- data.frame(
  value = "NJ",
  replacement = "New Jersey",
  variable = "adm1"
)

(raw_match5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
                             fuzzy = TRUE, type = "resolve_inner"))

(raw_remain5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
                              fuzzy = TRUE, type = "resolve_anti"))


epicentre-msf/hmatch documentation built on Nov. 15, 2023, 1:47 a.m.