knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/" ) options(digits = 4, width = 160)
An R package for cleaning and matching messy hierarchically-structured data (e.g. country / region / district / municipality). The general goal is to match sets of hierarchical values in a raw dataset to corresponding values within a reference dataset, while accounting for potential discrepancies such as:
Install from GitHub with:
# install.packages("remotes") remotes::install_github("epicentre-msf/hmatch")
hmatch
: match hierarchical sequences up to the highest-resolution level specified within a given row of raw data, optionally allowing for missing values below the match level, and fuzzy matches (using the stringdist package)hmatch_tokens
: match tokens rather than entire strings to allow for variation in multi-term nameshmatch_permute
: sequentially permute hierarchical columns to allow for values entered at the wrong levelhmatch_parents
: match values at a given hierarchical level based on shared sets of 'offspring'hmatch_settle
: try matching at every level and settle for the highest-resolution match possiblehmatch_manual
: match using a user-supplied dictionaryhmatch_split
: implement any other hmatch_
function separately at each hierarchical level, only on unique sequenceshmatch_composite
: implement a variety of matching strategies in sequence, from most to least strictIndependent of optional fuzzy matching with
stringdist, hmatch
functions
use behind-the-scenes string standardization to help account for variation in
character case, punctuation, spacing, or use of accents between the raw and
reference data. E.g.
raw_value reference_value match ---------------------------------------------------- original: ILE DE FRANCE Île-de-France FALSE standardized: ile_de_france ile_de_france TRUE
Users can choose default standardization (illustrated above), no
standardization, or supply their own preferred function to standardize strings
(e.g. tolower
).
The hmatch
package contains example datasets ne_raw
(messy geographical
data) and ne_ref
(reference data derived from a shapefile), based on a small
subset of northeastern North America.
library(hmatch) head(ne_raw) # raw messy data head(ne_ref) # reference data derived from shapefile
hmatch()
We'll start with a simple call to hmatch
to see which rows can be matched with
no extra magic.
hmatch(ne_raw, ne_ref, pattern = "^adm")
There are still quite a few unmatched rows, and entry 'PID14' actually matches
two different rows within ref
, so we'll press on. We can separate the matched
and unmatched rows using inner- and anti-joins respectively, specifically using
the "resolve_" join type here to only consider matches that are unique.
(raw_match1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")) (raw_remain1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_anti"))
Next we'll add in fuzzy-matching, using the default maximum string-distance of 1.
hmatch(raw_remain1, ne_ref, pattern = "^adm", fuzzy = TRUE, type = "inner")
Only one additional unique match, so we'll again split and move on. Note that
we've been using the pattern
argument above to specify the hierarchical
columns in raw
and ref
, but because the hierarchical columns have the same
names in raw
and ref
(and are the only matching column names), we can drop
the pattern
argument for brevity.
(raw_match2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_inner")) (raw_remain2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_anti"))
Next let's try hmatch_tokens
, which matches based on components of strings
(i.e. tokens) rather than entire strings.
(raw_match3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_inner")) (raw_remain3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_anti"))
If there are any values entered at the wrong hierarchical level, we can try systematically permuting the hierarchical columns before matching.
(raw_match4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_inner")) (raw_remain4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_anti"))
For the remaining rows that we haven't yet matched, there a few options. We
could use hmatch_settle()
to settle for matches below the highest-resolution
level specified within a given row of raw
. We could also do some 'manual'
comparison of the raw and reference datasets and create a dictionary to recode
values within raw
to match corresponding entries in ref
. Here we'll do both.
ne_dict <- data.frame( value = "NJ", replacement = "New Jersey", variable = "adm1" ) (raw_match5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict, fuzzy = TRUE, type = "resolve_inner")) (raw_remain5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict, fuzzy = TRUE, type = "resolve_anti"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.