knitr::opts_chunk$set(echo = TRUE)
First let's start by pulling in some text. In this case, we have a snippet of the wikipedia entry for "biology".
con <- file("wikipedia_biology.txt") txt <- readLines(con) close(con) print(txt)
We can see how the higher level functions in the package will parse a text, by running the custom parsing function vec2words
. This function uses logic that preserves some important biological information, like the "+" and "-" at the end of immunological terms like CD4+ and IL-2-. In this way, the tools are biology specific, which you can see by
looking at the unit test for vec2words
in the github repo: https://github.com/RGLab/corpusFreq/blob/main/tests/testthat/test-vec2words.R
words <- unique(vec2words(txt)) print(words)
The makeFreqTbl
function uses vec2words
to first parse the text to words, then generate a frequency table of
those words.
ft <- makeFreqTbl(txt) print(ft)
Now, we can look at how interactiveSpellCheck
works under the hood by correcting the badSentence
below.
badSentence <- "This is a baad biologickal terme. I should try a different set of varietions. Selectiv pressure may be exerted." # under the hood, interactiveSpellCheck first parses the input words <- vec2words(badSentence) print(words) # Then, a suggestion list is created using both `hunspell` and `corpusFreq` suggLs <- corpusFreq:::makeSuggList(words, ft) print(suggLs) # The interactive piece of the interactiveSpellCheck then allows the user to select # to replace one of the misspelled words with one of the suggestions, either from hunspell::suggest() # or corpusFreq. The console session will look like the following: # NOTES: # leaving the replacement field blank means do not replace. # Entering 'f' uses frequency table suggestion # Entering 'd' uses dictionary suggestion # # frequency table and dictionary suggestions only shown if present. # word not found: baad # Frequency Table Suggestion: and # Dictionary Suggestion: baas # enter replacement for baad: bad # # word not found: biologickal # Frequency Table Suggestion: biologickal # Dictionary Suggestion: biological # enter replacement for biologickal: d # # word not found: terme # Frequency Table Suggestion: term # Dictionary Suggestion: none # enter replacement for terme: f # # word not found: varietions # Frequency Table Suggestion: variations # Dictionary Suggestion: none # enter replacement for varietions: f # # word not found: selectiv # Frequency Table Suggestion: selective # Dictionary Suggestion: none # enter replacement for selectiv: f # # [1] "This is a bad biological term. I should try a different set of variations. selective pressure may be exerted."
Note above that the input object is returned with corrections so it is easy to work with in a workflow. Also an output file is created with all the substitutions that you can then bring into a Rmd for reproducibility.
Example of the output file created:
# work done on 2018-04-04 10:56:23 spellCheckRes <- function(x){ x <- gsub(pattern = 'baad', replacement = 'bad', x, ignore.case = TRUE) x <- gsub(pattern = 'biologickal', replacement = 'biological', x, ignore.case = TRUE) x <- gsub(pattern = 'terme', replacement = 'term', x, ignore.case = TRUE) x <- gsub(pattern = 'varietions', replacement = 'variations', x, ignore.case = TRUE) x <- gsub(pattern = 'selectiv', replacement = 'selective', x, ignore.case = TRUE) }
Once we have included the function generated by the output file, either by copy-pasting or sourcing the file, we can
use the function to reproducibly generate our goodSentence
that is the corrected version of the badSentence
.
goodSentence <- spellCheckRes(badSentence) print(goodSentence)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.