fuzzy_match_all: Fuzzy match (term vector).
In datavores/vgsample: A Representative Sample of Video Games

Description Usage Arguments Details Value

Checks all terms in a term vector for fuzzy matches.

1
2
3

fuzzy_match_all(term_vector, max_dist = 0.1, min_test_length = NA,
  skip_pure_digit = FALSE, assume_unique = FALSE, match_max = 10,
  remove_matches = FALSE, dist_method = "jw", jw_penalty = 0)

`term_vector`	Character vector of terms to be evaluated.
`max_dist`	Numeric from 0 to 1. Sets threshold for no match. See agrepl.
`min_test_length`	Integer. Sets minimum length for term to be evaluated at all. Note: Excluded terms can still be matched against, they just won't be used as source terms.
`skip_pure_digit`	Boolean. If TRUE, a term that consists only of digits will not be evaluated at all. Note: Same behavior as for min_test_length.
`assume_unique`	Boolean. If TRUE, the function assumes that 100 have already been filtered. In this case, the function attempts to minimize matches by ignoring any terms that have more than match_max matches. This is based on the logic that terms with high match rates in a unique set are more likely to be "promiscuous" terms (i.e., have highly very common characters/patterns, such as "the") than duplicate terms. Such terms can still be matched against but will not be treated as source terms. If remove_matches is set to TRUE, this also prevent excess removal of terms from the population due to high-match rates from early terms.
`match_max`	Integer. If assume_unique == TRUE, then this sets the threshold for excluding "high" match terms from evaluation.
`remove_matches`	Boolean. If TRUE, terms will be removed from the population being evaluated against if they are ever flagged as a match. This shrinks the term population any time assocations are discovered, reducing the number of comparisons for following source terms. This also helps us avoid a variety of tricky issues that arise when terms can matched multiple times or when terms can act as both source and match.
`dist_method`	The method used to measure similarity between two strings. See "?stringdist" for details and links to method descriptions.
`jw_penalty`	The default similarity metric is the Jaro distance. A penalty can be applied to convert to using Jaro-Winkler distance.

Takes a character vector (population of terms) as input. Uses fuzzy_match to evaluate each term against the rest of the terms. Behavior can be tweaked in a variety of ways that will both impact function runtime and adjust the criteria for what terms should be evaluated and what constitutes a match among terms. Returns a list of the evaluated terms and their associated matches.

Note: The function is structured so that no backwards evaluation occurs. If a term has been compared against the term population (i.e., served as the source term), it is removed from further comparisons. This reduces the number of comparisons that are required as the code progresses and minimizes redundant work.

Returns

datavores/vgsample documentation built on May 14, 2019, 8:59 p.m.