library(knitr) opts_chunk$set(cache = TRUE, message = FALSE)
Often you find yourself with a set of words that you want to combine with a "dictionary"- it could be a literal dictionary (as in this case) or a domain-specific category system. But you want to allow for small differences in spelling or punctuation.
The fuzzyjoin package comes with a set of common misspellings (from Wikipedia):
library(dplyr) library(fuzzyjoin) data(misspellings) misspellings
# use the dictionary of words from the qdapDictionaries package, # which is based on the Nettalk corpus. library(qdapDictionaries) words <- tbl_df(DICTIONARY) words
As an example, we'll pick 1000 of these words (you could try it on all of them though), and use stringdist_inner_join
to join them against our dictionary.
set.seed(2016) sub_misspellings <- misspellings %>% sample_n(1000)
joined <- sub_misspellings %>% stringdist_inner_join(words, by = c(misspelling = "word"), max_dist = 1)
By default, stringdist_inner_join
uses optimal string alignment (Damerau–Levenshtein distance), and we're setting a maximum distance of 1 for a join. Notice that they've been joined in cases where misspelling
is close to (but not equal to) word
:
joined
Note that there are some redundancies; words that could be multiple items in the dictionary. These end up with one row per "guess" in the output. How many words did we classify?
joined %>% count(misspelling, correct)
So we found a match in the dictionary for about half of the misspellings. In how many of the ones we classified did we get at least one of our guesses right?
which_correct <- joined %>% group_by(misspelling, correct) %>% summarize(guesses = n(), one_correct = any(correct == word)) which_correct # percentage of guesses getting at least one right mean(which_correct$one_correct) # number uniquely correct (out of the original 1000) sum(which_correct$guesses == 1 & which_correct$one_correct)
Not bad.
Note that stringdist_inner_join
is not the only function we can use. If we're interested in including the words that we couldn't classify, we could have used stringdist_left_join
:
left_joined <- sub_misspellings %>% stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 1) left_joined left_joined %>% filter(is.na(word))
(To get just the ones without matches immediately, we could have used stringdist_anti_join
). If we increase our distance threshold, we'll increase the fraction with a correct guess, but also get more false positive guesses:
left_joined2 <- sub_misspellings %>% stringdist_left_join(words, by = c(misspelling = "word"), max_dist = 2) left_joined2 left_joined2 %>% filter(is.na(word))
Most of the missing words here simply aren't in our dictionary.
You can try other distance thresholds, other dictionaries, and other distance metrics (see stringdist-metrics for more). This function is especially useful on a domain-specific dataset, such as free-form survey input that is likely to be close to one of a handful of responses.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.