InVocabulary | R Documentation |
Compares a pair of strings x and y using a reference vocabulary. Different scores are returned depending on whether both/one/neither of x and y are in the reference vocabulary.
InVocabulary( vocab, both_in_distinct = 0.7, both_in_same = 1, one_in = 1, none_in = 1, ignore_case = FALSE )
vocab |
a vector containing in-vocabulary (known) strings. Any strings not in this vector are out-of-vocabulary (unknown). |
both_in_distinct |
score to return if the pair of values being
compared are both in |
both_in_same |
score to return if the pair of values being
compared are both in |
one_in |
score to return if only one of the pair of values being
compared is in |
none_in |
score to return if none of the pair of values being
compared is in |
ignore_case |
a logical. If TRUE, case is ignored when comparing the strings. |
This comparator is not intended to produce useful scores on its own. Rather, it is intended to produce multiplicative factors which can be applied to other similarity/distance scores. It is particularly useful for comparing names when a reference list (vocabulary) of known names is available. For example, it can be configured to down-weight the similarity scores of distinct (known) names like "Roberto" and "Umberto" which are semantically very different, but deceptively similar in terms of edit distance. The normalized Levenshtein similarity for these two names is 75%, but their similarity can be reduced to 53% if multiplied by the score from this comparator using the default settings.
An InVocabulary
instance is returned, which is an S4 class inheriting from
StringComparator
.
## Compare names with possible typos using a reference of known names known_names <- c("Roberto", "Umberto", "Alberto", "Emberto", "Norberto", "Humberto") m1 <- InVocabulary(known_names) m2 <- Levenshtein(similarity = TRUE, normalize = TRUE) x <- "Emberto" y <- c("Enberto", "Umberto") # "Emberto" and "Umberto" are likely to refer to distinct people (since # they are known distinct names) so their Levenshtein similarity is # downweighted to 0.61. "Emberto" and "Enberto" may refer to the same # person (likely typo), so their Levenshtein similarity of 0.87 is not # downweighted. similarities <- m1(x, y) * m2(x, y)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.