Compute string distance the tidy way. Built on top of the ‘stringdist’ package.
You’ll get the dev version on:
devtools::install_github("ColinFay/tidystringdist")
Stable version is available with :
install.packages("tidystringdist")
First, you need to create a tibble with the combinations of words you
want to compare. You can do this with the tidy_comb
and
tidy_comb_all
functions. The first takes a base word and combines it
with each elements of a list or a column of a data.frame, the 2nd
combines all the possible couples from a list or a column.
If you already have a data.frame with two columns containing the strings to compare, you can skip this part.
library(tidystringdist)
tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 A B
#> 2 A C
#> 3 B C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 setosa versicolor
#> 2 setosa virginica
#> 3 versicolor virginica
tidy_comb("Paris", state.name[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 Alabama Paris
#> 2 Alaska Paris
#> 3 Arizona Paris
Once you’ve got this data.frame, you can use tidy_string_dist
to
compute string distance. This function takes a data.frame, the two
columns containing the strings, and a stringdist method.
Note that if you’ve used the tidy_comb
function to create you
data.frame, you won’t need to set the column names.
library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 3,741 x 12
#> V1 V2 osa lv dl hamming lcs qgram
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Luke Skywalker C-3PO 14 14 14 Inf 19 19
#> 2 Luke Skywalker R2-D2 14 14 14 Inf 19 19
#> 3 Luke Skywalker Darth Vader 11 11 11 Inf 17 17
#> 4 Luke Skywalker Leia Organa 11 11 11 Inf 17 15
#> 5 Luke Skywalker Owen Lars 12 12 12 Inf 15 11
#> 6 Luke Skywalker Beru Whitesun lars 16 16 16 Inf 22 18
#> 7 Luke Skywalker R5-D4 14 14 14 Inf 19 19
#> 8 Luke Skywalker Biggs Darklighter 13 13 13 Inf 21 19
#> 9 Luke Skywalker Obi-Wan Kenobi 14 14 14 14 24 22
#> 10 Luke Skywalker Anakin Skywalker 5 5 5 Inf 8 8
#> # ... with 3,731 more rows, and 4 more variables: cosine <dbl>,
#> # jaccard <dbl>, jw <dbl>, soundex <dbl>
Default call compute all the methods. You can use specific method with
the method
argument:
tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
#> # A tibble: 3,741 x 4
#> V1 V2 osa jw
#> * <chr> <chr> <dbl> <dbl>
#> 1 Luke Skywalker C-3PO 14 1.0000000
#> 2 Luke Skywalker R2-D2 14 1.0000000
#> 3 Luke Skywalker Darth Vader 11 0.5752165
#> 4 Luke Skywalker Leia Organa 11 0.5335498
#> 5 Luke Skywalker Owen Lars 12 0.4624339
#> 6 Luke Skywalker Beru Whitesun lars 16 0.4656085
#> 7 Luke Skywalker R5-D4 14 1.0000000
#> 8 Luke Skywalker Biggs Darklighter 13 0.5728291
#> 9 Luke Skywalker Obi-Wan Kenobi 14 0.6349206
#> 10 Luke Skywalker Anakin Skywalker 5 0.2816558
#> # ... with 3,731 more rows
The goal is to provide a convenient interface to work with other tools from the tidyverse.
tidy_stringdist(tidy_comb_sw, method= "osa") %>%
filter(osa > 20) %>%
arrange(desc(osa))
#> # A tibble: 11 x 3
#> V1 V2 osa
#> <chr> <chr> <dbl>
#> 1 C-3PO Jabba Desilijic Tiure 21
#> 2 C-3PO Wicket Systri Warrick 21
#> 3 R2-D2 Wicket Systri Warrick 21
#> 4 R5-D4 Wicket Systri Warrick 21
#> 5 Jabba Desilijic Tiure IG-88 21
#> 6 Jabba Desilijic Tiure Cordé 21
#> 7 Jabba Desilijic Tiure R4-P17 21
#> 8 Jabba Desilijic Tiure BB8 21
#> 9 IG-88 Wicket Systri Warrick 21
#> 10 Wicket Systri Warrick R4-P17 21
#> 11 Wicket Systri Warrick BB8 21
starwars %>%
filter(species == "Droid") %>%
tidy_comb_all(name) %>%
tidy_stringdist() %>%
summarise_if(is.numeric, mean)
#> # A tibble: 1 x 10
#> osa lv dl hamming lcs qgram cosine jaccard jw
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4.4 4.4 4.4 Inf 7.4 7.4 0.8304896 0.8671032 0.6422222
#> # ... with 1 more variables: soundex <dbl>
Questions and feedbacks welcome!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.