FuzzMatcher: Fuzzy character string matching ( ratios )

Description Usage Arguments Format Details Methods References Examples

Description

Fuzzy character string matching ( ratios )

Usage

1
# init <- FuzzMatcher$new(decoding = NULL)

Arguments

decoding

either NULL or a character string. If not NULL then the decoding parameter takes one of the standard python encodings (such as 'utf-8'). See the details and references link for more information.

string1

a character string.

string2

a character string.

force_ascii

allow only ASCII characters (force convert to ascii)

full_process

either TRUE or FALSE. If TRUE then it process the string by : 1. removing all but letters and numbers, 2. trim whitespace, 3. force to lower case

Format

An object of class R6ClassGenerator of length 24.

Details

the decoding parameter is useful in case of non-ascii character strings. If this parameter is not NULL then the force_ascii parameter (if applicable) is internally set to FALSE. Decoding applies only to python 2 configurations, as in python 3 character strings are decoded to unicode by default.

the Partial_token_set_ratio method works in the following way : 1. Find all alphanumeric tokens in each string, 2. treat them as a set, 3. construct two strings of the form, <sorted_intersection><sorted_remainder>, 4. take ratios of those two strings, 5. controls for unordered partial matches (HERE partial match is TRUE)

the Partial_token_sort_ratio method returns the ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing.

the Ratio method returns a ration in form of an integer value based on a SequenceMatcher-like class, which is built on top of the Levenshtein package (https://github.com/miohtama/python-Levenshtein)

the QRATIO method performs a quick ratio comparison between two strings. Runs full_process from utils on both strings. Short circuits if either of the strings is empty after processing.

the WRATIO method returns a measure of the sequences' similarity between 0 and 100, using different algorithms. Steps in the order they occur : 1. Run full_process from utils on both strings, 2. Short circuit if this makes either string empty, 3. Take the ratio of the two processed strings (fuzz.ratio), 4. Run checks to compare the length of the strings (If one of the strings is more than 1.5 times as long as the other use partial_ratio comparisons - scale partial results by 0.9 - this makes sure only full results can return 100 - If one of the strings is over 8 times as long as the other instead scale by 0.6), 5. Run the other ratio functions (if using partial ratio functions call partial_ratio, partial_token_sort_ratio and partial_token_set_ratio scale all of these by the ratio based on length otherwise call token_sort_ratio and token_set_ratio all token based comparisons are scaled by 0.95 - on top of any partial scalars) 6. Take the highest value from these results round it and return it as an integer.

the UWRATIO method returns a measure of the sequences' similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode

the UQRATIO method returns a Unicode quick ratio. It calls QRATIO with force_ascii set to FALSE.

the Token_sort_ratio method returns a measure of the sequences' similarity between 0 and 100 but sorting the token before comparing

the Partial_ratio returns the ratio of the most similar substring as a number between 0 and 100.

the Token_set_ratio method works in the following way : 1. Find all alphanumeric tokens in each string, 2. treat them as a set, 3. construct two strings of the form, <sorted_intersection><sorted_remainder>, 4. take ratios of those two strings, 5. controls for unordered partial matches (HERE partial match is FALSE)

Methods

FuzzMatcher$new(decoding = NULL)
--------------
Partial_token_set_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Partial_token_sort_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Ratio(string1 = NULL, string2 = NULL)
--------------
QRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)
--------------
WRATIO(string1 = NULL, string2 = NULL, force_ascii = TRUE)
--------------
UWRATIO(string1 = NULL, string2 = NULL)
--------------
UQRATIO(string1 = NULL, string2 = NULL)
--------------
Token_sort_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)
--------------
Partial_ratio(string1 = NULL, string2 = NULL)
--------------
Token_set_ratio(string1 = NULL, string2 = NULL, force_ascii = TRUE, full_process = TRUE)

References

https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py, https://docs.python.org/3/library/codecs.html#standard-encodings

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
if (check_availability()) {


  library(fuzzywuzzyR)

  s1 = "Atlanta Falcons"

  s2 = "New York Jets"

  init = FuzzMatcher$new()

  init$Partial_token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)

  init$Partial_token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)

  init$Ratio(string1 = s1, string2 = s2)

  init$QRATIO(string1 = s1, string2 = s2, force_ascii = TRUE)

  init$WRATIO(string1 = s1, string2 = s2, force_ascii = TRUE)

  init$UWRATIO(string1 = s1, string2 = s2)

  init$UQRATIO(string1 = s1, string2 = s2)

  init$Token_sort_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)

  init$Partial_ratio(string1 = s1, string2 = s2)

  init$Token_set_ratio(string1 = s1, string2 = s2, force_ascii = TRUE, full_process = TRUE)


}

fuzzywuzzyR documentation built on May 2, 2019, 8:53 a.m.