Jaro: Jaro String/Sequence Comparator

View source: R/Jaro.R

JaroR Documentation

Jaro String/Sequence Comparator

Description

Compares a pair of strings/sequences x and y based on the number of greedily-aligned characters/sequence elements and the number of transpositions. It was developed for comparing names at the U.S. Census Bureau.

Usage

Jaro(similarity = TRUE, ignore_case = FALSE, use_bytes = FALSE)

Arguments

similarity

a logical. If TRUE, similarity scores are returned (default), otherwise distances are returned (see definition under Details).

ignore_case

a logical. If TRUE, case is ignored when comparing strings.

use_bytes

a logical. If TRUE, strings are compared byte-by-byte rather than character-by-character.

Details

For simplicity we assume x and y are strings in this section, however the comparator is also implemented for more general sequences.

When similarity = TRUE (default), the Jaro similarity is computed as

sim(x, y) = (1/3)(m/|x| + m/|y| + (m-floor(t/2)/m)

where m is the number of "matching" characters (defined below), t is the number of "transpositions", and |x|,|y| are the lengths of the strings x and y. The similarity takes on values in the range [0, 1], where 1 corresponds to a perfect match.

The number of "matching" characters m is computed using a greedy alignment algorithm. The algorithm iterates over the characters in x, attempting to align the i-th character x_i with the first matching character in y. When looking for matching characters in y, the algorithm only considers previously un-matched characters within a window [max(0, i - w), min(|y|, i + w)] where w = floor(max(|x|, |y|)/2) - 1. The alignment process yields a subsequence of matching characters from x and y. The number of "transpositions" t is defined to be the number of positions in the subsequence of x which are misaligned with the corresponding position in y.

When similarity = FALSE, the Jaro distance is computed as

dist(x,y) = 1 - sim(x,y).

Value

A Jaro instance is returned, which is an S4 class inheriting from StringComparator.

Note

The Jaro distance is not a metric, as it does not satisfy the identity axiom dist(x, y) = 0 <=> x = y.

References

Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84(406), 414-420.

See Also

The JaroWinkler comparator modifies the Jaro comparator by boosting the similarity score for strings/sequences that have matching prefixes.

Examples

## Compare names
Jaro()("Martha", "Mathra")
Jaro()("Eileen", "Phyllis")


comparator documentation built on March 18, 2022, 6:15 p.m.