knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", gganimate = list( nframes = 50 ) )
clustringr clusters a vector of strings into groups of small mutual "edit distance" (see stringdist), using graph algorithms. Notice it's unsupervised, i.e., you do not need to pre-specify cluster count. Graph visualization of the results is provided.
Currently a development version is available on github.
# install.packages('devtools') devtools::install_github('dan-reznik/clustringr')
In the example below a vector of 9 strings is clustered into 4 groups by levenshtein distance and connected components. The call to cluster_strings() returns a list w/ 3 elements, the last of which is df_clusters which associates to every input string a cluster, along with its cluster size.
library(clustringr) s_vec <- c("alcool", "alcohol", "alcoholic", "brandy", "brandie", "cachaça", "whisky", "whiskie", "whiskers") s_clust <- cluster_strings(s_vec # input vector ,clean=T # dedup and squish ,method="lv" # levenshtein # use: method="dl" (dam-lev) or "osa" for opt-seq-align ,max_dist=3 # max edit distance for neighbors ,algo="cc" # connected components # use algo="eb" for edge-betweeness ) s_clust$df_clusters
To view a graph of the clusters, simply pass the structure returned by cluster_strings to cluster_plot:
cluster_plot(s_clust ,min_cluster_size=1 # ,label_size=2.5 # size of node labels # ,repel=T # whether labels should be repelled )
The clustringr package comes with quijote_words, a ~22k row data frame of the unique words (in Spanish) in Miguel de Cervantes' "Don Quijote". Full text can be obtained here.
Let's sample these words into a smaller subset:
library(dplyr) quijote_words_sampled <- clustringr::quijote_words %>% filter(between(freq,8,11),len>6) %>% pull("word") quijote_words_sampled%>%length
Now let's cluster these and view the results as a graph-plot, showing only those clusters with at least 3 elements:
quijote_words_sampled %>% cluster_strings(method="lv",max_dist=2) %>% cluster_plot(min_cluster_size=3)
Happy clustering!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.