Description Usage Arguments Details Value Author(s) See Also
View source: R/A01.cleanCorpus.R
cleanDocument
cleans the HC Corpus document
1 | cleanDocument(rawDocument)
|
rawDocument |
- the meta data and content for the file to be analyzed |
This function reads a corpus document and performs a series of conditioning, reshaping, normalization, and clean up tasks. The conditioning tasks include:
ControlReplace UTF-8 control characters with spaces
QuotationsConvert non-standard quotations to single quote
EncodingConvert document from UTF-8 to ASCII enconding
The reshaping task uses the quanteda package https://cran.r-project.org/web/packages/quanteda/quanteda.pdf to reshape the corpus documents into sentences
The normalization tasks include the following:
controlRemove ASCII control characters
nonprintRemove ASCII non-printable characters
emailsRemove email addresses
urlsRemove urls
hashtagsRemove twitter hashtags
controlRemove ASCII control characters
digitsRemove digits
punctRemove punctuation except the apostrophe
longWordsRemove words 40 characters or longer
profanityRemove profanity
correctCorrect contractions and common misspellings#'
Finally, the clean up tasks include:
whiteSpaceRemove extra whitespace from documents
punctRemove stray apostrophes and punctuation
emptySentRemove empty sentences
cleanDocument Cleaned text document in unlisted vector format.
John James, j2sdatalab@gmail.com
Other text processing functions: analyzeCorpus
,
cleanCorpus
, extractLines
,
getCorpus
, getStats
,
summarizeAnalysis
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.