fuzzy_match_all: Fuzzy match (term vector).

Description Usage Arguments Details Value

Description

Checks all terms in a term vector for fuzzy matches.

Usage

1
2
3
fuzzy_match_all(term_vector, max_dist = 0.1, min_test_length = NA,
  skip_pure_digit = FALSE, assume_unique = FALSE, match_max = 10,
  remove_matches = FALSE, dist_method = "jw", jw_penalty = 0)

Arguments

term_vector

Character vector of terms to be evaluated.

max_dist

Numeric from 0 to 1. Sets threshold for no match. See agrepl.

min_test_length

Integer. Sets minimum length for term to be evaluated at all. Note: Excluded terms can still be matched against, they just won't be used as source terms.

skip_pure_digit

Boolean. If TRUE, a term that consists only of digits will not be evaluated at all. Note: Same behavior as for min_test_length.

assume_unique

Boolean. If TRUE, the function assumes that 100 have already been filtered. In this case, the function attempts to minimize matches by ignoring any terms that have more than match_max matches. This is based on the logic that terms with high match rates in a unique set are more likely to be "promiscuous" terms (i.e., have highly very common characters/patterns, such as "the") than duplicate terms. Such terms can still be matched against but will not be treated as source terms. If remove_matches is set to TRUE, this also prevent excess removal of terms from the population due to high-match rates from early terms.

match_max

Integer. If assume_unique == TRUE, then this sets the threshold for excluding "high" match terms from evaluation.

remove_matches

Boolean. If TRUE, terms will be removed from the population being evaluated against if they are ever flagged as a match. This shrinks the term population any time assocations are discovered, reducing the number of comparisons for following source terms. This also helps us avoid a variety of tricky issues that arise when terms can matched multiple times or when terms can act as both source and match.

dist_method

The method used to measure similarity between two strings. See "?stringdist" for details and links to method descriptions.

jw_penalty

The default similarity metric is the Jaro distance. A penalty can be applied to convert to using Jaro-Winkler distance.

Details

Takes a character vector (population of terms) as input. Uses fuzzy_match to evaluate each term against the rest of the terms. Behavior can be tweaked in a variety of ways that will both impact function runtime and adjust the criteria for what terms should be evaluated and what constitutes a match among terms. Returns a list of the evaluated terms and their associated matches.

Note: The function is structured so that no backwards evaluation occurs. If a term has been compared against the term population (i.e., served as the source term), it is removed from further comparisons. This reduces the number of comparisons that are required as the code progresses and minimizes redundant work.

Value

Returns


datavores/vgsample documentation built on May 14, 2019, 8:59 p.m.