knitr::opts_chunk$set(echo = TRUE)

Creating a corpus frequency table from scratch

First let's start by pulling in some text. In this case, we have a snippet of the wikipedia entry for "biology".

con <- file("wikipedia_biology.txt")
txt <- readLines(con)
close(con)
print(txt)

We can see how the higher level functions in the package will parse a text, by running the custom parsing function vec2words. This function uses logic that preserves some important biological information, like the "+" and "-" at the end of immunological terms like CD4+ and IL-2-. In this way, the tools are biology specific, which you can see by looking at the unit test for vec2words in the github repo: https://github.com/RGLab/corpusFreq/blob/main/tests/testthat/test-vec2words.R

words <- unique(vec2words(txt))
print(words)

The makeFreqTbl function uses vec2words to first parse the text to words, then generate a frequency table of those words.

ft <- makeFreqTbl(txt)
print(ft)

Now, we can look at how interactiveSpellCheck works under the hood by correcting the badSentence below.

badSentence <- "This is a baad biologickal terme. I should try a different set of varietions. Selectiv pressure may be exerted."

# under the hood, interactiveSpellCheck first parses the input
words <- vec2words(badSentence)
print(words)

# Then, a suggestion list is created using both `hunspell` and `corpusFreq`
suggLs <- corpusFreq:::makeSuggList(words, ft)
print(suggLs)

# The interactive piece of the interactiveSpellCheck then allows the user to select
# to replace one of the misspelled words with one of the suggestions, either from hunspell::suggest()
# or corpusFreq.  The console session will look like the following:

# NOTES:
# leaving the replacement field blank means do not replace.
# Entering 'f' uses frequency table suggestion
# Entering 'd' uses dictionary suggestion 
# 
# frequency table and dictionary suggestions only shown if present.
# word not found: baad
# Frequency Table Suggestion: and
# Dictionary Suggestion: baas
# enter replacement for baad: bad
# 
# word not found: biologickal
# Frequency Table Suggestion: biologickal
# Dictionary Suggestion: biological
# enter replacement for biologickal: d
# 
# word not found: terme
# Frequency Table Suggestion: term
# Dictionary Suggestion: none
# enter replacement for terme: f
# 
# word not found: varietions
# Frequency Table Suggestion: variations
# Dictionary Suggestion: none
# enter replacement for varietions: f
# 
# word not found: selectiv
# Frequency Table Suggestion: selective
# Dictionary Suggestion: none
# enter replacement for selectiv: f
# 
# [1] "This is a bad biological term. I should try a different set of variations. selective pressure may be exerted."

Note above that the input object is returned with corrections so it is easy to work with in a workflow. Also an output file is created with all the substitutions that you can then bring into a Rmd for reproducibility.

Example of the output file created:

# work done on 2018-04-04 10:56:23 
spellCheckRes <- function(x){
    x <- gsub(pattern = 'baad', replacement = 'bad', x, ignore.case = TRUE)
    x <- gsub(pattern = 'biologickal', replacement = 'biological', x, ignore.case = TRUE)
    x <- gsub(pattern = 'terme', replacement = 'term', x, ignore.case = TRUE)
    x <- gsub(pattern = 'varietions', replacement = 'variations', x, ignore.case = TRUE)
    x <- gsub(pattern = 'selectiv', replacement = 'selective', x, ignore.case = TRUE)
}

Once we have included the function generated by the output file, either by copy-pasting or sourcing the file, we can use the function to reproducibly generate our goodSentence that is the corrected version of the badSentence.

goodSentence <- spellCheckRes(badSentence)
print(goodSentence)


RGLab/corpusFreq documentation built on May 1, 2021, 4:58 p.m.