Hunspell Spell Checking and Morphological Analysis
hunspell function is a high-level wrapper for finding spelling
errors within a text document. It takes a character vector with text (
xml format), parses out the words
and returns a list with incorrect words for each line. It effectively combines
hunspell_check in a single step.
Other functions in the package operate on individual words, see details.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
hunspell(text, format = c("text", "man", "latex", "html", "xml"), dict = dictionary("en_US"), ignore = en_stats) hunspell_parse(text, format = c("text", "man", "latex", "html", "xml"), dict = dictionary("en_US")) hunspell_check(words, dict = dictionary("en_US")) hunspell_suggest(words, dict = dictionary("en_US")) hunspell_analyze(words, dict = dictionary("en_US")) hunspell_stem(words, dict = dictionary("en_US")) hunspell_info(dict = dictionary("en_US")) dictionary(lang = "en_US", affix = NULL, cache = TRUE)
character vector with arbitrary input text
input format; supported parsers are
a dictionary object or string which can be passed to
character vector with additional approved words added to the dictionary
character vector with individual words to spell check
dictionary file or language, see details
file path to corresponding affix file. If
speed up loading of dicationaries by caching
Hunspell uses a special dictionary format that defines which stems and affixes are
valid in a given language. The
hunspell_analyze function shows how a
word breaks down into a valid stem plus affix. The
function is similar but only returns valid stems for a given word. Stemming can be
used to summarize text (e.g in a wordcloud). The
takes a vector of individual words and tests each one for correctness. Finally
hunspell_suggest is used to suggest correct alternatives for each
(incorrect) input word.
Because spell checking is usually done on a document, the package includes some
parsers to extract words from various common formats. With
we can parse plain-text, latex and man format. R also has a few built-in parsers
SweaveTeXFilter, see also
The package searches for dictionaries in the working directory as well as in the
standard system locations. Additional search paths can be specified by setting
DICPATH environment variable. A US English dictionary (
included with the package; other dictionaries need to be installed by the system.
Most operating systems already include compatible dictionaries with names such as
To manually install dictionaries, copy the corresponding
~/Library/Spelling or a custom directory specified in
Alternatively you can pass the entire path to the
.dic file as the
parameter. Some popular sources of dictionaries are
iconv to convert input text to
the encoding used by the dictionary. This will fail if
text contains characters
which are unsupported by that particular encoding. For this reason UTF-8 dictionaries
are preferable over legacy 8bit dictionaries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# Check individual words words <- c("beer", "wiskey", "wine") correct <- hunspell_check(words) print(correct) # Find suggestions for incorrect words hunspell_suggest(words[!correct]) # Extract incorrect from a piece of text bad <- hunspell("spell checkers are not neccessairy for langauge ninja's") print(bad[]) hunspell_suggest(bad[]) # Stemming words <- c("love", "loving", "lovingly", "loved", "lover", "lovely", "love") hunspell_stem(words) hunspell_analyze(words) # Check an entire latex document setwd(tempdir()) download.file("https://arxiv.org/e-print/1406.4806v1", "1406.4806v1.tar.gz", mode = "wb") untar("1406.4806v1.tar.gz") text <- readLines("content.tex", warn = FALSE) bad_words <- hunspell(text, format = "latex") sort(unique(unlist(bad_words))) # Summarize text by stems (e.g. for wordcloud) allwords <- hunspell_parse(text, format = "latex") stems <- unlist(hunspell_stem(unlist(allwords))) words <- head(sort(table(stems), decreasing = TRUE), 200)
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.