Levenshtein: Levenshtein String/Sequence Comparator

View source: R/Levenshtein.R

LevenshteinR Documentation

Levenshtein String/Sequence Comparator

Description

The Levenshtein (edit) distance between two strings/sequences x and y is the minimum cost of operations (insertions, deletions or substitutions) required to transform x into y.

Usage

Levenshtein(
  deletion = 1,
  insertion = 1,
  substitution = 1,
  normalize = FALSE,
  similarity = FALSE,
  ignore_case = FALSE,
  use_bytes = FALSE
)

Arguments

deletion

positive cost associated with deletion of a character or sequence element. Defaults to unit cost.

insertion

positive cost associated insertion of a character or sequence element. Defaults to unit cost.

substitution

positive cost associated with substitution of a character or sequence element. Defaults to unit cost.

normalize

a logical. If TRUE, distances are normalized to the unit interval. Defaults to FALSE.

similarity

a logical. If TRUE, similarity scores are returned instead of distances. Defaults to FALSE.

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

A Levenshtein similarity is returned if similarity = TRUE, which is defined as

sim(x, y) = (w_d |x| + w_i |y| - dist(x, y))/2

where |x|, |y| are the number of characters in x and y respectively, dist is the Levenshtein distance, w_d is the cost of a deletion and w_i is the cost of an insertion.

Normalization of the Levenshtein distance/similarity to the unit interval is also supported by setting normalize = TRUE. The normalization approach follows Yujian and Bo (2007), and ensures that the distance remains a metric when the costs of insertion w_i and deletion w_d are equal. The normalized distance dist_n is defined as

dist_n(x, y) = 2 * dist(x, y) / (w_d |x| + w_i |y| + dist(x, y)),

and the normalized similarity sim_n is defined as

sim_n(x, y) = 1 - dist_n(x, y) = sim(x, y) / (w_d |x| + w_i |y| - sim(x, y)).

Value

A Levenshtein instance is returned, which is an S4 class inheriting from StringComparator.

Note

If the costs of deletion and insertion are equal, this comparator is symmetric in x and y. In addition, the normalized and unnormalized distances satisfy the properties of a metric.

References

Navarro, G. (2001), "A guided tour to approximate string matching", ACM Computing Surveys (CSUR), 33(1), 31-88.

Yujian, L. & Bo, L. (2007), "A Normalized Levenshtein Distance Metric", IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1091–1095.

See Also

Other edit-based comparators include Hamming, LCS, OSA and DamerauLevenshtein.

Examples

## Compare names with potential typos
x <- c("Brian Cheng", "Bryan Cheng", "Kondo Onyejekwe", "Condo Onyejekve")
pairwise(Levenshtein(), x, return_matrix = TRUE)

## When the substitution cost is high, Levenshtein distance reduces to LCS distance
Levenshtein(substitution = 100)("Iran", "Iraq") == LCS()("Iran", "Iraq")


comparator documentation built on March 18, 2022, 6:15 p.m.