View source: R/FuzzyTokenSet.R
FuzzyTokenSet | R Documentation |
Compares a pair of token sets x and y by computing the optimal cost of transforming x into y using single-token operations (insertions, deletions and substitutions). The cost of single-token operations is determined at the character-level using an internal string comparator.
FuzzyTokenSet( inner_comparator = Levenshtein(normalize = TRUE), agg_function = base::mean, deletion = 1, insertion = 1, substitution = 1 )
inner_comparator |
inner string distance comparator of class
|
agg_function |
function used to aggregate the costs of the optimal
operations. Defaults to |
deletion |
non-negative weight associated with deletion of a token. Defaults to 1. |
insertion |
non-negative weight associated insertion of a token. Defaults to 1. |
substitution |
non-negative weight associated with substitution of a token. Defaults to 1. |
A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets x and y, this comparator computes the optimal cost of transforming x into y using the following single-token operations:
deleting a token a from x at cost w_d * inner(a, "")
inserting a token b in y at cost w_i * inner("", b)
substituting a token a in x for a token b in y at cost w_s * inner(a, b)
where inner is an internal string comparator and
w_d, w_i, w_s are non-negative weights, referred to as deletion
,
insertion
and substitution
in the parameter list. By default, the
mean cost of the optimal set of operations is returned. Other methods of
aggregating the costs are supported by specifying a non-default
agg_function
.
If the internal string comparator is a distance function, then the optimal set of operations minimize the cost. Otherwise, the optimal set of operations maximize the cost. The optimization problem is solved exactly using a linear sum assignment solver.
This comparator is qualitatively similar to the MongeElkan
comparator, however it is arguably more principled, since it is formulated
as a cost optimization problem. It also offers more control over the costs
of missing tokens (by varying the deletion
and insertion
weights).
This is useful for comparing full names, when dropping a name (e.g.
middle name) shouldn't be severely penalized.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') FuzzyTokenSet()(x, y) # Reduce the cost associated with missing words FuzzyTokenSet(deletion = 0.5, insertion = 0.5)(x, y) ## Compare full name with abbreviated name, reducing the penalty ## for dropping parts of the name fullname <- "JOSE ELIAS TEJADA BASQUES" name <- "JOSE BASQUES" # Tokenize strings on white space fullname <- strsplit(fullname, '\\s+') name <- strsplit(name, '\\s+') comparator <- FuzzyTokenSet(deletion = 0.5) comparator(fullname, name) < comparator(name, fullname) # TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.