README.md

Coverage
Status

Travis-CI Build
Status

tidystringdist

Compute string distance the tidy way. Built on top of the ‘stringdist’ package.

Install tidystringdist

You’ll get the dev version on:

devtools::install_github("ColinFay/tidystringdist")

Stable version is available with :

install.packages("tidystringdist")

tidystringdist basic workflow

tidycomb

First, you need to create a tibble with the combinations of words you want to compare. You can do this with the tidy_comb and tidy_comb_all functions. The first takes a base word and combines it with each elements of a list or a column of a data.frame, the 2nd combines all the possible couples from a list or a column.

If you already have a data.frame with two columns containing the strings to compare, you can skip this part.

library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#>      V1    V2
#> * <chr> <chr>
#> 1     A     B
#> 2     A     C
#> 3     B     C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#>           V1         V2
#> *      <chr>      <chr>
#> 1     setosa versicolor
#> 2     setosa  virginica
#> 3 versicolor  virginica
tidy_comb("Paris", state.name[1:3])
#> # A tibble: 3 x 2
#>        V1    V2
#> *   <chr> <chr>
#> 1 Alabama Paris
#> 2  Alaska Paris
#> 3 Arizona Paris

tidy_string_dist

Once you’ve got this data.frame, you can use tidy_string_dist to compute string distance. This function takes a data.frame, the two columns containing the strings, and a stringdist method.

Note that if you’ve used the tidy_comb function to create you data.frame, you won’t need to set the column names.

library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 3,741 x 12
#>                V1                 V2   osa    lv    dl hamming   lcs qgram
#>  *          <chr>              <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
#>  1 Luke Skywalker              C-3PO    14    14    14     Inf    19    19
#>  2 Luke Skywalker              R2-D2    14    14    14     Inf    19    19
#>  3 Luke Skywalker        Darth Vader    11    11    11     Inf    17    17
#>  4 Luke Skywalker        Leia Organa    11    11    11     Inf    17    15
#>  5 Luke Skywalker          Owen Lars    12    12    12     Inf    15    11
#>  6 Luke Skywalker Beru Whitesun lars    16    16    16     Inf    22    18
#>  7 Luke Skywalker              R5-D4    14    14    14     Inf    19    19
#>  8 Luke Skywalker  Biggs Darklighter    13    13    13     Inf    21    19
#>  9 Luke Skywalker     Obi-Wan Kenobi    14    14    14      14    24    22
#> 10 Luke Skywalker   Anakin Skywalker     5     5     5     Inf     8     8
#> # ... with 3,731 more rows, and 4 more variables: cosine <dbl>,
#> #   jaccard <dbl>, jw <dbl>, soundex <dbl>

Default call compute all the methods. You can use specific method with the method argument:

tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
#> # A tibble: 3,741 x 4
#>                V1                 V2   osa        jw
#>  *          <chr>              <chr> <dbl>     <dbl>
#>  1 Luke Skywalker              C-3PO    14 1.0000000
#>  2 Luke Skywalker              R2-D2    14 1.0000000
#>  3 Luke Skywalker        Darth Vader    11 0.5752165
#>  4 Luke Skywalker        Leia Organa    11 0.5335498
#>  5 Luke Skywalker          Owen Lars    12 0.4624339
#>  6 Luke Skywalker Beru Whitesun lars    16 0.4656085
#>  7 Luke Skywalker              R5-D4    14 1.0000000
#>  8 Luke Skywalker  Biggs Darklighter    13 0.5728291
#>  9 Luke Skywalker     Obi-Wan Kenobi    14 0.6349206
#> 10 Luke Skywalker   Anakin Skywalker     5 0.2816558
#> # ... with 3,731 more rows

Tidyverse workflow

The goal is to provide a convenient interface to work with other tools from the tidyverse.

tidy_stringdist(tidy_comb_sw, method= "osa") %>%
  filter(osa > 20) %>%
  arrange(desc(osa))
#> # A tibble: 11 x 3
#>                       V1                    V2   osa
#>                    <chr>                 <chr> <dbl>
#>  1                 C-3PO Jabba Desilijic Tiure    21
#>  2                 C-3PO Wicket Systri Warrick    21
#>  3                 R2-D2 Wicket Systri Warrick    21
#>  4                 R5-D4 Wicket Systri Warrick    21
#>  5 Jabba Desilijic Tiure                 IG-88    21
#>  6 Jabba Desilijic Tiure                 Cordé    21
#>  7 Jabba Desilijic Tiure                R4-P17    21
#>  8 Jabba Desilijic Tiure                   BB8    21
#>  9                 IG-88 Wicket Systri Warrick    21
#> 10 Wicket Systri Warrick                R4-P17    21
#> 11 Wicket Systri Warrick                   BB8    21
starwars %>%
  filter(species == "Droid") %>%
  tidy_comb_all(name) %>%
  tidy_stringdist() %>% 
  summarise_if(is.numeric, mean)
#> # A tibble: 1 x 10
#>     osa    lv    dl hamming   lcs qgram    cosine   jaccard        jw
#>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>     <dbl>     <dbl>     <dbl>
#> 1   4.4   4.4   4.4     Inf   7.4   7.4 0.8304896 0.8671032 0.6422222
#> # ... with 1 more variables: soundex <dbl>

Contact

Questions and feedbacks welcome!



Try the tidystringdist package in your browser

Any scripts or data that you put into this service are public.

tidystringdist documentation built on May 2, 2019, 3:23 p.m.