lev_weighted_token_ratio: Weighted token similarity measure

View source: R/weighted.R

lev_weighted_token_ratioR Documentation

Weighted token similarity measure

Description

Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.

Usage

lev_weighted_token_ratio(a, b, weights = list(), ...)

Arguments

a, b

The input strings

weights

List of token weights. For example, weights = list(foo = 0.9, bar = 0.1). Any tokens omitted from weights will be given a weight of 1.

...

Additional arguments to be passed to stringdist::stringdistmatrix() or stringdist::stringsimmatrix().

Value

A float

Details

The algorithm used here is as follows:

  • Tokenise the input strings

  • Compute the edit distance between each pair of tokens

  • Compute the maximum edit distance between each pair of tokens

  • Apply any weights from the weights argument

  • Return 1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))

See Also

Other weighted token functions: lev_weighted_token_set_ratio(), lev_weighted_token_sort_ratio()

Examples

lev_weighted_token_ratio("jim ltd", "tim ltd")

lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))

levitate documentation built on Oct. 1, 2023, 1:08 a.m.