View source: R/FuzzyTokenSet.R
| FuzzyTokenSet | R Documentation |
Compares a pair of token sets x and y by computing the
optimal cost of transforming x into y using single-token
operations (insertions, deletions and substitutions). The cost of
single-token operations is determined at the character-level using an
internal string comparator.
FuzzyTokenSet(
inner_comparator = Levenshtein(normalize = TRUE),
agg_function = base::mean,
deletion = 1,
insertion = 1,
substitution = 1
)
inner_comparator |
inner string distance comparator of class
|
agg_function |
function used to aggregate the costs of the optimal
operations. Defaults to |
deletion |
non-negative weight associated with deletion of a token. Defaults to 1. |
insertion |
non-negative weight associated insertion of a token. Defaults to 1. |
substitution |
non-negative weight associated with substitution of a token. Defaults to 1. |
A token set is an unordered enumeration of tokens, which may include
duplicates. Given two token sets x and y, this comparator
computes the optimal cost of transforming x into y using the
following single-token operations:
deleting a token a from x at cost w_d \times \mathrm{inner}(a, "")
inserting a token b in y at cost w_i \times \mathrm{inner}("", b)
substituting a token a in x for a token b
in y at cost w_s \times \mathrm{inner}(a, b)
where \mathrm{inner} is an internal string comparator and
w_d, w_i, w_s are non-negative weights, referred to as deletion,
insertion and substitution in the parameter list. By default, the
mean cost of the optimal set of operations is returned. Other methods of
aggregating the costs are supported by specifying a non-default
agg_function.
If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.
This comparator is qualitatively similar to the MongeElkan
comparator, however it is arguably more principled, since it is formulated
as a cost optimization problem. It also offers more control over the costs
of missing tokens (by varying the deletion and insertion weights).
This is useful for comparing full names, when dropping a name (e.g.
middle name) shouldn't be severely penalized.
## Compare names with heterogenous representations
x <- "The University of California - San Diego"
y <- "Univ. Calif. San Diego"
# Tokenize strings on white space
x <- strsplit(x, '\\s+')
y <- strsplit(y, '\\s+')
FuzzyTokenSet()(x, y)
# Reduce the cost associated with missing words
FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y)
## Compare full name with abbreviated name, reducing the penalty
## for dropping parts of the name
fullname <- "JOSE ELIAS TEJADA BASQUES"
name <- "JOSE BASQUES"
# Tokenize strings on white space
fullname <- strsplit(fullname, '\\s+')
name <- strsplit(name, '\\s+')
comparator <- FuzzyTokenSet(deletion = 0.5)
comparator(fullname, name) < comparator(name, fullname) # TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.