In armandvalsesia/Foodmapping: Food Items Mapping using Fuzzy Matching

library(pander)
library(Foodmapping)

Compute fuzzy matching similarity scores using Partial Token Sort Ratio

The v_fz_tk_sort_r compute similarities between two strings (e.g.two food item names), using the Partial_token_sort_ratio method. This method returns the ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing. Applied to the comparison of food names, a ratio close to 100 indicates that one food name is fully found in the second food name.

v_fz_tk_sort_r( "Tomatoes" , "raw. tomatoes" )

High ratio reflect high similarity between food names:

v_fz_tk_sort_r( "Tomatoe and basil soup" , "Tomatoe soup with basil" )

Conversely, smaller ratios indicate no / little similarities between food names

v_fz_tk_sort_r( "Tomatoes" , "Carot soup" )

Compute fuzzy similarities from a large list of food names

The v_fz_tk_sort_r function can be applied for element-wise comparisons between two vectors.

data(food_sample)

food_names1 <- food_sample$FOODNAME_ENG
food_names2 <- food_sample$FOODNAME_ENG_COMP

scores <- v_fz_tk_sort_r( food_names1 , food_names2)

results <- data.frame(
  name1 = food_names1,
  name2 = food_names2,
  score = scores
)
pander(head(results), split.table = Inf )

The v_fz_tk_sort_r is optimized for a large number of comparison and compute the similarity metrics in parallel.

# make a large example of  10'000 comparisons - takes < 5 seconds
large_example <- expand.grid( A = food_names1[ 1:100 ] , B  = food_names2[ 1:100 ] ) # create all pairwise comparisons
dim( large_example )
system.time(
  scores <- v_fz_tk_sort_r( as.character( large_example$A ), as.character( large_example$B ) )
)