fix_typos | R Documentation |
This function is meant to allow users to fix typos in strings that are not normally found in dictionaries.
fix_typos(typo_v, threshold, occ_ratio)
typo_v |
vector of strings that will have its typos cleaned |
threshold |
the string distance maximum used to determine typos. This argument is specified as the percentage of a typo that should at most be expected to be insertions, additons, deletions, and transpositions. |
occ_ratio |
the minimum ratio of correctly spelled words to their typo. This argument helps to weed out words that are similar but valid. For example commonly occurring valid names such as Adam and Amy will not be recognized as typos even though they are similar because they both appear often. Typos are recognized by their similarity in addition to their infrequent occurrence. |
There are great tools like the hunspell package that allow users to fix typos for words found in dictionaries, but these functions struggle to work for strings like proper nouns and other specific terminology not usually found in common dictionaries. This function uses the text being cleaned as a dictionary. It finds probable correctly spelled words based on their high occurrence and finds typos based on their low occurence. This is based on the theory that typos will appear as infrequently used words due no one using them on purpose, and they will be a short string distance from commonly occurring correctly spelled words.
reformatted vector with typos replaced with correctly spelled words
if(interactive()){ #EXAMPLE1 }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.