knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  gganimate = list(
    nframes = 50
  )
)

clustringr

clustringr clusters a vector of strings into groups of small mutual "edit distance" (see stringdist), using graph algorithms. Notice it's unsupervised, i.e., you do not need to pre-specify cluster count. Graph visualization of the results is provided.

Installation

Currently a development version is available on github.

# install.packages('devtools')
devtools::install_github('dan-reznik/clustringr')

Usage

In the example below a vector of 9 strings is clustered into 4 groups by levenshtein distance and connected components. The call to cluster_strings() returns a list w/ 3 elements, the last of which is df_clusters which associates to every input string a cluster, along with its cluster size.

library(clustringr)
s_vec <- c("alcool",
           "alcohol",
           "alcoholic",
           "brandy",
           "brandie",
           "cachaça",
           "whisky",
           "whiskie",
           "whiskers")
s_clust <- cluster_strings(s_vec # input vector
                           ,clean=T # dedup and squish
                           ,method="lv" # levenshtein
                           # use: method="dl" (dam-lev) or "osa" for opt-seq-align
                           ,max_dist=3 # max edit distance for neighbors
                           ,algo="cc" # connected components
                           # use algo="eb" for edge-betweeness
)
s_clust$df_clusters

Cluster Visualization

To view a graph of the clusters, simply pass the structure returned by cluster_strings to cluster_plot:

cluster_plot(s_clust
             ,min_cluster_size=1
             # ,label_size=2.5 # size of node labels
             # ,repel=T # whether labels should be repelled
             )

Supplied Data Set: Don Quijote's unique words

The clustringr package comes with quijote_words, a ~22k row data frame of the unique words (in Spanish) in Miguel de Cervantes' "Don Quijote". Full text can be obtained here.

Let's sample these words into a smaller subset:

library(dplyr)
quijote_words_sampled <- clustringr::quijote_words %>%
  filter(between(freq,8,11),len>6) %>%
  pull("word")
quijote_words_sampled%>%length

Now let's cluster these and view the results as a graph-plot, showing only those clusters with at least 3 elements:

quijote_words_sampled %>%
  cluster_strings(method="lv",max_dist=2) %>%
  cluster_plot(min_cluster_size=3)

Happy clustering!



dan-reznik/clustringr documentation built on May 20, 2019, 12:35 p.m.