stringdist-metrics: String metrics in 'stringdist'

stringdist-metricsR Documentation

String metrics in stringdist

Description

This page gives an overview of the string dissimilarity measures offered by stringdist.

String Metrics

String metrics are ways of quantifying the dissimilarity between two finite sequences, usually text strings. Over the years, many such measures have been developed. Some are based on a mathematical understanding of the set of all strings that can be composed from a finite alphabet, others are based on more heuristic principles, such as how a text string sounds when pronounced by a native English speaker.

The terms 'string metrics' and 'string distance' are used more or less interchangibly in literature. From a mathematical point of view, string metrics often do not obey the demands that are usually required from a distance function. For example, it is not true for all string metrics that a distance of 0 means that two strings are the same (e.g. in the q-gram distance). Nevertheless, string metrics are very useful in practice and have many applications.

The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring. For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use. If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice.

Currently, the following distance metrics are supported by stringdist.

Method name Description
osa Optimal string aligment, (restricted Damerau-Levenshtein distance).
lv Levenshtein distance (as in R's native adist).
dl Full Damerau-Levenshtein distance.
hamming Hamming distance (a and b must have same nr of characters).
lcs Longest common substring distance.
qgram q-gram distance.
cosine cosine distance between q-gram profiles
jaccard Jaccard distance between q-gram profiles
jw Jaro, or Jaro-Winkler distance.
soundex Distance based on soundex encoding (see below)

A short description of string metrics supported by stringdist

See Van der Loo (2014) for an extensive description and references. The review papers of Navarro (2001) and Boytsov (2011) provide excellent technical overviews of respectively online and offline string matching algorithms.

The Hamming distance (method='hamming') counts the number of character substitutions that turns b into a. If a and b have different number of characters the distance is Inf.

The Levenshtein distance (method='lv') counts the number of deletions, insertions and substitutions necessary to turn b into a. This method is equivalent to R's native adist function.

The Optimal String Alignment distance (method='osa') is like the Levenshtein distance but also allows transposition of adjacent characters. Here, each substring may be edited only once. (For example, a character cannot be transposed twice to move it forward in the string).

The full Damerau-Levenshtein distance (method='dl') is like the optimal string alignment distance except that it allows for multiple edits on substrings.

The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.

A q-gram (method='qgram') is a subsequence of q consecutive characters of a string. If x (y) is the vector of counts of q-gram occurrences in a (b), the q-gram distance is given by the sum over the absolute differences |x_i-y_i|. The computation is aborted when q is is larger than the length of any of the strings. In that case Inf is returned.

The cosine distance (method='cosine') is computed as 1-x\cdot y/(\|x\|\|y\|), where x and y were defined above.

Let X be the set of unique q-grams in a and Y the set of unique q-grams in b. The Jaccard distance (method='jaccard') is given by 1-|X\cap Y|/|X\cup Y|.

The Jaro distance (method='jw', p=0), is a number between 0 (exact match) and 1 (completely dissimilar) measuring dissimilarity between strings. It is defined to be 0 when both strings have length 0, and 1 when there are no character matches between a and b. Otherwise, the Jaro distance is defined as 1-(1/3)(w_1m/|a| + w_2m/|b| + w_3(m-t)/m). Here,|a| indicates the number of characters in a, m is the number of character matches and t the number of transpositions of matching characters. The w_i are weights associated with the characters in a, characters in b and with transpositions. A character c of a matches a character from b when c occurs in b, and the index of c in a differs less than \max(|a|,|b|)/2 -1 (where we use integer division) from the index of c in b. Two matching characters are transposed when they are matched but they occur in different order in string a and b.

The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d - l\cdot p\cdot d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a 'prefix' factor, which in the work of Winkler is often chosen 0.1.

For the soundex distance (method='soundex'), strings are translated to a soundex code (see phonetic for a specification). The distance between strings is 0 when they have the same soundex code, otherwise 1. Note that soundex recoding is only meaningful for characters in the ranges a-z and A-Z. A warning is emitted when non-printable or non-ascii characters are encountered. Also see printable_ascii.

The running_cosine distance is an implementatation of the cosine distance especially meant for fuzzy text search as in afind. In fuzzy search a window of n characters slides accros a (long) string while for each position of the window the distance between the part of the string in the window and a search pattern is computed. The (position of) the window with the shortest distance to the search pattern is returned. Sliding the window with a single position only affects the q-grams at the beginning and end of the window, and the 'running cosine' distance uses this and a few other tricks to save calculations.

References

  • MPJ van der Loo (2014) The stringdist package for approximate string matching. The R Journal 6(1) 111-122.

  • L. Boytsov (2011). Indexing methods for approximate dictionary searching: comparative analyses. ACM Journal of experimental algorithmics 16 1-88.

  • G. Navarro (2001). A guided tour to approximate string matching. ACM Computing Surveys 33 31-88.

See Also

  • Functions applying string metrics to text: stringdist, stringdistmatrix, amatch

  • Functions applying string metrics to integer sequences: seq_dist, seq_distmatrix, seq_amatch

  • Encoding issues: stringdist-encoding


stringdist documentation built on May 29, 2024, 11:13 a.m.