cluster_strings: Cluster Strings by Edit-Distance

Description Usage Arguments Value Examples

Description

Cluster Strings by Edit-Distance

Usage

1
2
cluster_strings(s_vec, clean = T, method = "osa", max_dist = 3,
  algo = "cc")

Arguments

s_vec

a vector of character strings

clean

whether to space-squish and de-duplicate s_vec

method

one of "osa","lv","dl" (as in 'stringdist')

max_dist

max distance (typically damerau-levenshtein) between related strings.

algo

one of "cc" (connected components) or "eb" (edge betweeness)

Value

a data frame containing cluster membership for each input string

Examples

1
2
3
s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça")
s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc")
s_clust$df_clusters

Example output

# A tibble: 6 x 3
  cluster  size node     
    <int> <int> <chr>    
1       1     3 alcohol  
2       1     3 alcoholic
3       1     3 alcool   
4       2     2 brandie  
5       2     2 brandy   
6       3     1 cachaça  

clustringr documentation built on May 1, 2019, 9:23 p.m.