SelectSimilarityFunction | R Documentation |
To call DeterministicLinkage
or ProbabilisticLinkage
it is mandatory to select a similarity function for each variable. Each element of the setup contains the two variable names and the method. For some methods further informations can be entered.
SelectSimilarityFunction(variable1, variable2, method = "jw", ind_c0 = FALSE, ind_c1 = FALSE, m = 0.9, u = 0.1, p = 0.05, epsilon = 0.0004, lower = 0.0, upper = 0.0, threshold = 0.85, jaroWeightFactor = 1.0, lenNgram = 2)
variable1 |
name of linking variable 1 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector. |
variable2 |
name of linking variable 2 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector. |
method |
linking method. Possible values are:
|
ind_c0 |
Only used for jw2. Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers. A nonzero value indicates the option is deactivated. |
ind_c1 |
Only used for jw2. All lower case characters are converted to upper case prior to the comparison. Disabling this feature means that the lower case string "code" will not be recognized as the same as the upper case string "CODE". Also, the adjustment for similar characters section only applies to uppercase characters. A nonzero value indicates the option is deactivated. |
m |
Initial m value for the EM algorithm. Only used when linking using |
u |
Initial u value for the EM algorithm. Only used when linking using |
p |
Initial p value for the EM algorithm. Only used when linking using |
epsilon |
epsilon is a stop criterum for the EM algorithm. The EM algorithm can be terminated when relative change of likelihood logarithm is less than epsilon. Only used when linking using |
lower |
Matches lower than 'lower' are classified as non-match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using |
upper |
Matches higher than 'upper' are classified as match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using |
threshold |
If using string similarities: Outputs only matches above the similarity threshold value. If using string distances: Outputs only matches below the set threshold distance. |
jaroWeightFactor |
By the Jaro weight adjustment the matching weight is adjusted according to the degree of similarity between the
variable values. The weight factor which determines the Jaro adjusted matching weight. Only used when linking using |
lenNgram |
Length of ngrams. Only used for the method ngram. Length of ngrams must be between 1 and 4. |
Calling the function will not return anything.
Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.
Winkler, W. E. (1988): Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods Vol. 667, American Statistical Association: 671.
DeterministicLinkage
,
ProbabilisticLinkage
,
SelectBlockingFunction
,
StandardizeString
# load test data testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv") testData <- read.csv(testFile, head = FALSE, sep = "\t", colClasses = "character") # define year of birth (V3) as blocking variable bl <- SelectBlockingFunction("V3", "V3", method = "exact") # Select first name and last name as linking variables, # to be linked using the jaro-winkler (first name) # and exact matching (last name) l1 <- SelectSimilarityFunction("V7","V7", method = "jw", ind_c0 = FALSE, ind_c1 = FALSE , m = 0.9, u = 0.1, lower = 0.0, upper = 0.0) l2 <- SelectSimilarityFunction("V8","V8", method = "exact") # Link the data as specified in bl and l1/l2 # (in this small example data is linked to itself) res <- ProbabilisticLinkage(testData$V1, testData, testData$V1, testData, similarity = c(l1, l2), blocking = bl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.