stringsim: Compute similarity scores between strings

Description Usage Arguments Details Value Examples

View source: R/stringsim.R

Description

stringsim computes pairwise string similarities between elements of character vectors a and b, where the vector with less elements is recycled. stringsimmatrix computes the string similarity matrix with rows according to a and columns according to b.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
stringsim(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  q = 1,
  ...
)

stringsimmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
    "soundex"),
  useBytes = FALSE,
  q = 1,
  ...
)

Arguments

a

R object (target); will be converted by as.character.

b

R object (source); will be converted by as.character.

method

Method for distance calculation. The default is "osa", see stringdist-metrics.

useBytes

Perform byte-wise comparison, see stringdist-encoding.

q

Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

...

additional arguments are passed on to stringdist and stringdistmatrix respectively.

Details

The similarity is calculated by first calculating the distance using stringdist, dividing the distance by the maximum possible distance, and substracting the result from 1. This results in a score between 0 and 1, with 1 corresponding to complete similarity and 0 to complete dissimilarity. Note that complete similarity only means equality for distances satisfying the identity property. This is not the case e.g. for q-gram based distances (for example if q=1, anagrams are completely similar). For distances where weights can be specified, the maximum distance is currently computed by assuming that all weights are equal to 1.

Value

stringsim returns a vector with similarities, which are values between 0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to complete dissimilarity. NA is returned when stringdist returns NA. Distances equal to Inf are truncated to a similarity of 0. stringsimmatrix works the same way but, equivalent to stringdistmatrix, returns a similarity matrix instead of a vector.

Examples

1
2
3
4
5
6
# Calculate the similarity using the default method of optimal string alignment
stringsim("ca", "abc")

# Calculate the similarity using the Jaro-Winkler method
# The p argument is passed on to stringdist
stringsim('MARTHA','MATHRA',method='jw', p=0.1)

stringdist documentation built on Sept. 9, 2021, 5:08 p.m.