MongeElkan | R Documentation |
Compares a pair of token sets x and y by computing similarity scores between all pairs of tokens using an internal string comparator, then taking the mean of the maximum scores for each token in x.
MongeElkan( inner_comparator = Levenshtein(similarity = TRUE, normalize = TRUE), agg_function = base::mean, symmetrize = FALSE )
inner_comparator |
internal string comparator of class
|
agg_function |
aggregation function to use when aggregating internal
similarities/distances between tokens. Defaults to |
symmetrize |
logical indicating whether to use a symmetrized version of the Monge-Elkan comparator. Defaults to FALSE. |
A token set is an unordered enumeration of tokens, which may include duplicates. Given two token sets x and y, the Monge-Elkan comparator is defined as:
ME(x, y) = 1/|x| sum_i{ max_j sim(x_i, y_j) }
where x_i is the i-th token in x, |x| is the number of tokens in x and sim is an internal string similarity comparator.
A generalization of the original Monge-Elkan comparator is implemented here, which allows for distance comparators in place of similarity comparators, and/or more general aggregation functions in place of the arithmetic mean. The generalized Monge-Elkan comparator is defined as:
ME(x, y) = agg(opt_j inner(x_i, y_j))
where inner is an internal distance or similarity comparator, opt is max if inner is a similarity comparator or min if it is a distance comparator, and agg is an aggregation function which takes a vector of scores for each token in x and returns a scalar.
By default, the Monge-Elkan comparator is asymmetric in its arguments x
and y. If symmetrize = TRUE
, a symmetric version of the comparator
is obtained as follows
ME_sym(x, y) = opt { ME(x, y), ME(y, x) }
where opt is defined above.
A MongeElkan
instance is returned, which is an S4 class inheriting from
StringComparator
.
Monge, A. E., & Elkan, C. (1996), "The Field Matching Problem: Algorithms and Applications", In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 267-270.
Jimenez, S., Becerra, C., Gelbukh, A., & Gonzalez, F. (2009), "Generalized Monge-Elkan Method for Approximate Text String Comparison", In Computational Linguistics and Intelligent Text Processing, pp. 559-570.
## Compare names with heterogenous representations x <- "The University of California - San Diego" y <- "Univ. Calif. San Diego" # Tokenize strings on white space x <- strsplit(x, '\\s+') y <- strsplit(y, '\\s+') MongeElkan()(x, y) ## The symmetrized variant is arguably more appropriate for this example MongeElkan(symmetrize = TRUE)(x, y) ## Using a different internal comparator changes the result MongeElkan(inner_comparator = BinaryComp(), symmetrize=TRUE)(x, y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.