## Description

Functions for computation of the similarity between two strings.

## Usage

 ```1 2 3``` ```jarowinkler(str1, str2, W_1=1/3, W_2=1/3, W_3=1/3, r=0.5) levenshteinSim(str1, str2) levenshteinDist(str1, str2) ```

## Arguments

 `str1,str2` Two character vectors to compare. `W_1,W_2,W_3` Adjustable weights. `r` Maximum transposition radius. A fraction of the length of the shorter string.

## Details

String metrics compute a similarity value in the range [0,1] for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.

`jarowinkler` is an implementation of the algorithm by Jaro and Winkler (see references). For the meaning of `W_1`, `W_2`, `W_3` and `r` see the referenced article. For most applications, the default values are reasonable.

`levenshteinDist` returns the Levenshtein distance, which cannot be directly used as a valid string comparator. `levenshteinSim` is a similarity function based on the Levenshtein distance, calculated by 1 - d(str1,str2) / max(A,B), where d is the Levenshtein distance function and A and B are the lenghts of the strings.

Arguments `str1` and `str2` are expected to be of type `"character"`. Non-alphabetical characters can be processed. Valid format combinations for the arguments are:

• Two arrays with the same dimensions.

• Two vectors. The shorter one is recycled as necessary.

## Value

A numeric vector with similarity values in the interval [0,1]. For `levenshteinDist`, the edit distance as an integer vector.

## Note

String comparison is case-sensitive, which means that for example `"R"` and `"r"` have a similarity of 0. If this behaviour is undesired, strings should be normalized before processing.

## Author(s)

Andreas Borg, Murat Sariyar

## References

Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association (1990), S. 354–369.

## Examples

 ```1 2 3 4 5 6``` ```# compare two strings: jarowinkler("Andreas","Anreas") # compare one string with several others: levenshteinSim("Andreas",c("Anreas","Andeas")) # compare two vectors of strings: jarowinkler(c("Andreas","Borg"),c("Andreas","Bork")) ```

### Example output

```Loading required package: DBI
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: 'bit'

The following object is masked from 'package:base':

xor

Attaching package ff
- getOption("fftempdir")=="/work/tmp/tmp/Rtmp3oHNpP"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system

Attaching package: 'ff'

The following objects are masked from 'package:bit':

clone, clone.default, clone.list

The following objects are masked from 'package:utils':

write.csv, write.csv2

The following objects are masked from 'package:base':

is.factor, is.ordered

Attaching package: 'ffbase'

The following objects are masked from 'package:ff':

[.ff, [.ffdf, [<-.ff, [<-.ffdf

The following objects are masked from 'package:base':

%in%, table

[c] IMBEI Mainz

The following object is masked from 'package:ff':

clone

The following object is masked from 'package:bit':

clone

[1] 0.9619048
[1] 0.8571429 0.8571429
[1] 1.0000000 0.8833333
```

RecordLinkage documentation built on Jan. 10, 2022, 1:07 a.m.