Functions for computation of the similarity between two strings.
1 2 3
Two character vectors to compare.
Maximum transposition radius. A fraction of the length of the shorter string.
String metrics compute a similarity value in the range [0,1] for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.
jarowinkler is an implementation of the algorithm by Jaro and Winkler
(see references). For the meaning of
r see the referenced article. For most applications, the default
values are reasonable.
levenshteinDist returns the Levenshtein
distance, which cannot be directly used as a valid string comparator.
levenshteinSim is a similarity function based on
the Levenshtein distance, calculated by
1 - d(str1,str2) / max(A,B), where d is the Levenshtein distance
function and A and B are the lenghts of the strings.
str2 are expected to be of type
Non-alphabetical characters can be processed. Valid format combinations for
the arguments are:
Two arrays with the same dimensions.
Two vectors. The shorter one is recycled as necessary.
A numeric vector with similarity values in the interval
levenshteinDist, the edit distance as an
String comparison is case-sensitive, which means that for example
"r" have a similarity of 0. If this behaviour is undesired,
strings should be normalized before processing.
Andreas Borg, Murat Sariyar
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association (1990), S. 354–369.
1 2 3 4 5 6
Loading required package: DBI Loading required package: RSQLite Loading required package: ff Loading required package: bit Attaching package bit package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2) creators: bit bitwhich coercion: as.logical as.integer as.bit as.bitwhich which operator: ! & | xor != == querying: print length any all min max range sum summary bit access: length<- [ [<- [[ [[<- for more help type ?bit Attaching package: 'bit' The following object is masked from 'package:base': xor Attaching package ff - getOption("fftempdir")=="/work/tmp/tmp/Rtmp3oHNpP" - getOption("ffextension")=="ff" - getOption("ffdrop")==TRUE - getOption("fffinonexit")==TRUE - getOption("ffpagesize")==65536 - getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes - getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system - getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system Attaching package: 'ff' The following objects are masked from 'package:bit': clone, clone.default, clone.list The following objects are masked from 'package:utils': write.csv, write.csv2 The following objects are masked from 'package:base': is.factor, is.ordered Loading required package: ffbase Attaching package: 'ffbase' The following objects are masked from 'package:ff': [.ff, [.ffdf, [<-.ff, [<-.ffdf The following objects are masked from 'package:base': %in%, table RecordLinkage library [c] IMBEI Mainz Attaching package: 'RecordLinkage' The following object is masked from 'package:ff': clone The following object is masked from 'package:bit': clone  0.9619048  0.8571429 0.8571429  1.0000000 0.8833333
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.