SelectSimilarityFunction: Select Similarity Function for Linkage

View source: R/MTBOptions.R

SelectSimilarityFunctionR Documentation

Select Similarity Function for Linkage

Description

To call DeterministicLinkage or ProbabilisticLinkage it is mandatory to select a similarity function for each variable. Each element of the setup contains the two variable names and the method. For some methods further informations can be entered.

Usage

SelectSimilarityFunction(variable1, variable2,
  method = "jw",
  ind_c0 = FALSE, ind_c1 = FALSE,
  m = 0.9, u = 0.1, p = 0.05, epsilon = 0.0004,
  lower = 0.0, upper = 0.0,
  threshold = 0.85, jaroWeightFactor = 1.0, lenNgram = 2)

Arguments

variable1

name of linking variable 1 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector.

variable2

name of linking variable 2 in the data.frame. The column must be of type character, numeric or integer, containing the data to be merged. The data vector must have the same length as the ID vector.

method

linking method. Possible values are:

  • 'exact' = Exact matching

  • 'exactCL' = Exact matching using capital letters

  • 'LCS' = Longest Common Subsequence

  • 'lv' = Levenshtein distance

  • 'dl' = Damerau Levenshtein distance

  • 'jaro' = Jaro similarity

  • 'jw' = Jaro-Winkler similarity

  • 'jw2' = Modified Jaro-Winkler similarity

  • 'ngram' = n-gram similarity

  • 'Gcp' = German census phonetic (Baystat)

  • 'Reth' = Reth-Schek (IBM) phonetic

  • 'Soundex' = Soundex phonetic

  • 'Metaphone' = Metaphone phonetic

  • 'DoubleMetaphone' = Double Metaphone phonetic

ind_c0

Only used for jw2.

Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers. A nonzero value indicates the option is deactivated.

ind_c1

Only used for jw2.

All lower case characters are converted to upper case prior to the comparison. Disabling this feature means that the lower case string "code" will not be recognized as the same as the upper case string "CODE". Also, the adjustment for similar characters section only applies to uppercase characters. A nonzero value indicates the option is deactivated.

m

Initial m value for the EM algorithm. Only used when linking using ProbabilisticLinkage. 0.0 < m < 1.0.

u

Initial u value for the EM algorithm. Only used when linking using ProbabilisticLinkage. 0.0 < u < 1.0.

p

Initial p value for the EM algorithm. Only used when linking using ProbabilisticLinkage. 0.0 < u < 1.0.

epsilon

epsilon is a stop criterum for the EM algorithm. The EM algorithm can be terminated when relative change of likelihood logarithm is less than epsilon. Only used when linking using ProbabilisticLinkage.

lower

Matches lower than 'lower' are classified as non-match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using ProbabilisticLinkage.

upper

Matches higher than 'upper' are classified as match. Everything between 'lower' and 'upper' is classified as possible match. Only used when linking using ProbabilisticLinkage.

threshold

If using string similarities: Outputs only matches above the similarity threshold value. If using string distances: Outputs only matches below the set threshold distance.

jaroWeightFactor

By the Jaro weight adjustment the matching weight is adjusted according to the degree of similarity between the variable values. The weight factor which determines the Jaro adjusted matching weight. Only used when linking using ProbabilisticLinkage.

lenNgram

Length of ngrams. Only used for the method ngram. Length of ngrams must be between 1 and 4.

Value

Calling the function will not return anything.

References

Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.

Winkler, W. E. (1988): Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods Vol. 667, American Statistical Association: 671.

See Also

DeterministicLinkage, ProbabilisticLinkage, SelectBlockingFunction, StandardizeString

Examples

# load test data
testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv")
testData <- read.csv(testFile, head = FALSE, sep = "\t",
  colClasses = "character")

# define year of birth (V3) as blocking variable
bl <- SelectBlockingFunction("V3", "V3", method = "exact")

# Select first name and last name as linking variables,
# to be linked using the jaro-winkler (first name)
# and exact matching (last name)
l1 <- SelectSimilarityFunction("V7","V7", method = "jw",
  ind_c0 = FALSE, ind_c1 = FALSE , m = 0.9, u = 0.1,
  lower = 0.0, upper = 0.0)
l2 <- SelectSimilarityFunction("V8","V8", method = "exact")

# Link the data as specified in bl and l1/l2
# (in this small example data is linked to itself)
res <- ProbabilisticLinkage(testData$V1, testData,
  testData$V1, testData, similarity = c(l1, l2), blocking = bl)


PPRL documentation built on Nov. 10, 2022, 5:41 p.m.