# stringsim: Compute similarity scores between strings In stringdist: Approximate String Matching, Fuzzy Text Search, and String Distance Functions

 stringsim R Documentation

## Compute similarity scores between strings

### Description

`stringsim` computes pairwise string similarities between elements of `character` vectors `a` and `b`, where the vector with less elements is recycled. `stringsimmatrix` computes the string similarity matrix with rows according to `a` and columns according to `b`.

### Usage

``````stringsim(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)

stringsimmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
useBytes = FALSE,
q = 1,
...
)
``````

### Arguments

 `a` R object (target); will be converted by `as.character`. `b` R object (source); will be converted by `as.character`. `method` Method for distance calculation. The default is `"osa"`, see `stringdist-metrics`. `useBytes` Perform byte-wise comparison, see `stringdist-encoding`. `q` Size of the `q`-gram; must be nonnegative. Only applies to `method='qgram'`, `'jaccard'` or `'cosine'`. `...` additional arguments are passed on to `stringdist` and `stringdistmatrix` respectively.

### Details

The similarity is calculated by first calculating the distance using `stringdist`, dividing the distance by the maximum possible distance, and substracting the result from 1. This results in a score between 0 and 1, with 1 corresponding to complete similarity and 0 to complete dissimilarity. Note that complete similarity only means equality for distances satisfying the identity property. This is not the case e.g. for q-gram based distances (for example if q=1, anagrams are completely similar). For distances where weights can be specified, the maximum distance is currently computed by assuming that all weights are equal to 1.

### Value

`stringsim` returns a vector with similarities, which are values between 0 and 1 where 1 corresponds to perfect similarity (distance 0) and 0 to complete dissimilarity. `NA` is returned when `stringdist` returns `NA`. Distances equal to `Inf` are truncated to a similarity of 0. `stringsimmatrix` works the same way but, equivalent to `stringdistmatrix`, returns a similarity matrix instead of a vector.

### Examples

``````

# Calculate the similarity using the default method of optimal string alignment
stringsim("ca", "abc")

# Calculate the similarity using the Jaro-Winkler method
# The p argument is passed on to stringdist
stringsim('MARTHA','MATHRA',method='jw', p=0.1)

``````

stringdist documentation built on May 29, 2024, 11:13 a.m.