n_gram_merge: Value merging based on ngram fingerprints

View source: R/n_gram_merge.R

n_gram_mergeR Documentation

Value merging based on ngram fingerprints

Description

This function takes a character vector and makes edits and merges values that are approximately equivalent yet not identical. It uses a two step process, the first is clustering values based on their ngram fingerprint (described here https://openrefine.org/docs/technical-reference/clustering-in-depth). The second step is merging values based on approximate string matching of the ngram fingerprints, using the [sd_lower_tri()] C function from the package stringdist.

Usage

n_gram_merge(
  vect,
  numgram = 2,
  ignore_strings = NULL,
  bus_suffix = TRUE,
  edit_threshold = 1,
  weight = c(d = 0.33, i = 0.33, s = 1, t = 0.5),
  ...
)

Arguments

vect

Character vector, items to be potentially clustered and merged.

numgram

Numeric value, indicating the number of characters that will occupy each ngram token. Default value is 2.

ignore_strings

Character vector, these strings will be ignored during the merging of values within vect. Default value is NULL.

bus_suffix

Logical, indicating whether the merging of records should be insensitive to common business suffixes or not. Default value is TRUE.

edit_threshold

Numeric value, indicating the threshold at which a merge is performed, based on the sum of the edit values derived from param weight. Default value is 1. If this parameter is set to 0 or NA, then no approximate string matching will be done, and all merging will be based on strings that have identical ngram fingerprints.

weight

Numeric vector, indicating the weights to assign to the four edit operations (see details below), for the purpose of approximate string matching. Default values are c(d = 0.33, i = 0.33, s = 1, t = 0.5). This parameter gets passed along to the stringdist function. Must be either a numeric vector of length four, or NA.

...

additional args to be passed along to the stringdist function. The acceptable args are identical to those of [stringdistmatrix()].

Details

The values of arg weight are edit distance values that get passed to the stringdist edit distance function. The param takes four arguments, each one is a specific type of edit, with default penalty value.

  • d: deletion, default value is 0.33

  • i: insertion, default value is 0.33

  • s: substitution, default value is 1

  • t: transposition, default value is 0.5

Value

Character vector with similar values merged.

Examples

x <- c("Acme Pizza, Inc.", "ACME PIZA COMPANY", "Acme Pizzazza LLC")

n_gram_merge(vect = x)

# The performance of the approximate string matching can be ajusted using
# parameters 'weight' or 'edit_threshold'
n_gram_merge(vect = x,
             weight = c(d = 0.4, i = 1, s = 1, t = 1))

# Use parameter 'ignore_strings' to ignore specific strings during merging
# of values.
x <- c("Bakersfield Highschool", "BAKERSFIELD high",
       "high school, bakersfield")
n_gram_merge(vect = x, ignore_strings = c("high", "school", "highschool"))


refinr documentation built on Nov. 13, 2023, 1:09 a.m.