Description Usage Arguments Details Value Note Author(s) References Examples
Functions for computation of the similarity between two strings.
1 2 3 | jarowinkler(str1, str2, W_1=1/3, W_2=1/3, W_3=1/3, r=0.5)
levenshteinSim(str1, str2)
levenshteinDist(str1, str2)
|
str1,str2 |
Two character vectors to compare. |
W_1,W_2,W_3 |
Adjustable weights. |
r |
Maximum transposition radius. A fraction of the length of the shorter string. |
String metrics compute a similarity value in the range [0,1] for two strings, with 1 denoting the highest (usually equality) and 0 denoting the lowest degree of similarity. In the context of Record Linkage, string similarities can improve the discernibility between matches and non-matches.
jarowinkler
is an implementation of the algorithm by Jaro and Winkler
(see references). For the meaning of W_1
, W_2
, W_3
and
r
see the referenced article. For most applications, the default
values are reasonable.
levenshteinDist
returns the Levenshtein
distance, which cannot be directly used as a valid string comparator.
levenshteinSim
is a similarity function based on
the Levenshtein distance, calculated by
1 - d(str1,str2) / max(A,B), where d is the Levenshtein distance
function and A and B are the lenghts of the strings.
Arguments str1
and str2
are expected to be of type
"character"
.
Non-alphabetical characters can be processed. Valid format combinations for
the arguments are:
Two arrays with the same dimensions.
Two vectors. The shorter one is recycled as necessary.
A numeric vector with similarity values in the interval
[0,1]. For levenshteinDist
, the edit distance as an
integer vector.
String comparison is case-sensitive, which means that for example
"R"
and "r"
have a similarity of 0. If this behaviour is undesired,
strings should be normalized before processing.
Andreas Borg, Murat Sariyar
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association (1990), S. 354–369.
1 2 3 4 5 6 | # compare two strings:
jarowinkler("Andreas","Anreas")
# compare one string with several others:
levenshteinSim("Andreas",c("Anreas","Andeas"))
# compare two vectors of strings:
jarowinkler(c("Andreas","Borg"),c("Andreas","Bork"))
|
Loading required package: DBI
Loading required package: RSQLite
Loading required package: ff
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit
Attaching package: 'bit'
The following object is masked from 'package:base':
xor
Attaching package ff
- getOption("fftempdir")=="/work/tmp/tmp/Rtmp3oHNpP"
- getOption("ffextension")=="ff"
- getOption("ffdrop")==TRUE
- getOption("fffinonexit")==TRUE
- getOption("ffpagesize")==65536
- getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system
Attaching package: 'ff'
The following objects are masked from 'package:bit':
clone, clone.default, clone.list
The following objects are masked from 'package:utils':
write.csv, write.csv2
The following objects are masked from 'package:base':
is.factor, is.ordered
Loading required package: ffbase
Attaching package: 'ffbase'
The following objects are masked from 'package:ff':
[.ff, [.ffdf, [<-.ff, [<-.ffdf
The following objects are masked from 'package:base':
%in%, table
RecordLinkage library
[c] IMBEI Mainz
Attaching package: 'RecordLinkage'
The following object is masked from 'package:ff':
clone
The following object is masked from 'package:bit':
clone
[1] 0.9619048
[1] 0.8571429 0.8571429
[1] 1.0000000 0.8833333
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.