| Levenshtein | R Documentation | 
The Levenshtein (edit) distance between two strings/sequences x and
y is the minimum cost of operations (insertions, deletions or
substitutions) required to transform x into y.
Levenshtein(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)
| deletion | positive cost associated with deletion of a character or sequence element. Defaults to unit cost. | 
| insertion | positive cost associated insertion of a character or sequence element. Defaults to unit cost. | 
| substitution | positive cost associated with substitution of a character or sequence element. Defaults to unit cost. | 
| normalize | a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE. | 
| similarity | a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE. | 
| ignore_case | a logical. If TRUE, case is ignored when comparing strings. | 
| use_bytes | a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character. | 
For simplicity we assume x and y are strings in this section,
however the comparator is also implemented for more general sequences.
A Levenshtein similarity is returned if similarity = TRUE, which
is defined as
\mathrm{sim}(x, y) = \frac{w_d |x| + w_i |y| - \mathrm{dist}(x, y)}{2},
where |x|, |y| are the number of characters in x and
y respectively, \mathrm{dist} is the Levenshtein distance,
w_d is the cost of a deletion and w_i is the cost of an
insertion.
Normalization of the Levenshtein distance/similarity to the unit interval
is also supported by setting normalize = TRUE. The normalization approach
follows Yujian and Bo (2007), and ensures that the distance remains a metric
when the costs of insertion w_i and deletion w_d are equal.
The normalized distance \mathrm{dist}_n is defined as
\mathrm{dist}_n(x, y) = \frac{2 \mathrm{dist}(x, y)}{w_d |x| + w_i |y| + \mathrm{dist}(x, y)},
and the normalized similarity \mathrm{sim}_n is defined as
\mathrm{sim}_n(x, y) = 1 - \mathrm{dist}_n(x, y) = \frac{\mathrm{sim}(x, y)}{w_d |x| + w_i |y| - \mathrm{sim}(x, y)}.
A Levenshtein instance is returned, which is an S4 class inheriting from
StringComparator.
If the costs of deletion and insertion are equal, this comparator is
symmetric in x and y. In addition, the normalized and
unnormalized distances satisfy the properties of a metric.
Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.
Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.
Other edit-based comparators include Hamming, LCS,
OSA and DamerauLevenshtein.
## Compare names with potential typos
x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve")
pairwise(Levenshtein(), x, return_matrix = TRUE)
## When the substitution cost is high, Levenshtein distance reduces to LCS distance
Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.