| check_text | R Documentation |
check_text - Uncleaned text may result in errors, warnings, and
incorrect results in subsequent analysis. check_text checks text for
potential problems and suggests possible fixes. Potential text anomalies
that are detected include: factors, missing ending punctuation, empty cells,
double punctuation, non-space after comma, no alphabetic characters,
non-ASCII, missing value, and potentially misspelled words.
available_check - Provide a data.frame view of all the available
checks in the check_text function.
check_text(x, file = NULL, checks = NULL, n = 10, ...)
available_checks()
x |
The text variable. |
file |
A connection, or a character string naming the file to print to.
If |
checks |
A vector of checks to include from |
n |
The number of affected elements to print out (the rest are truncated). |
... |
ignored. |
Returns a list with the following potential text faults report:
contraction |
- Text elements that contain contractions |
date |
- Text elements that contain dates |
digit |
- Text elements that contain digits/numbers |
email |
- Text elements that contain email addresses |
emoticon |
- Text elements that contain emoticons |
empty |
- Text elements that contain empty text cells (all white space) |
escaped |
- Text elements that contain escaped back spaced characters |
hash |
- Text elements that contain Twitter style hash tags (e.g., #rstats) |
html |
- Text elements that contain HTML markup |
incomplete |
- Text elements that contain incomplete sentences (e.g., uses ending punctuation like ...) |
kern |
- Text elements that contain kerning (e.g., 'The B O M B!') |
list_column |
- Text variable that is a list column |
missing_value |
- Text elements that contain missing values |
misspelled |
- Text elements that contain potentially misspelled words |
no_alpha |
- Text elements that contain elements with no alphabetic (a-z) letters |
no_endmark |
- Text elements that contain elements with missing ending punctuation |
no_space_after_comma |
- Text elements that contain commas with no space afterwards |
non_ascii |
- Text elements that contain non-ASCII text |
non_character |
- Text variable that is not a character column (likely |
non_split_sentence |
- Text elements that contain unsplit sentences (more than one sentence per element) |
tag |
- Text elements that contain Twitter style handle tags (e.g., @trinker) |
time |
- Text elements that contain timestamps |
url |
- Text elements that contain URLs |
The output is a list containing meta checks and elemental checks but prints as a pretty formatted output with potential problem elements, the accompanying text, and possible suggestions to fix the text.
## Not run:
v <- list(c('foo', 'bar'), NA, c('hello', 'world'))
check_text(v)
w <- factor(unlist(v))
check_text(w)
x <- c("i like", "<p>i want. </p>thet them ther .", "I am ! that|", "", NA,
""they",were there", ".", " ", "?", "3;", "I like goud eggs!",
"i 4like...", "\\tgreat", 'She said "yes"')
check_text(x)
print(check_text(x), include.text=FALSE)
check_text(x, checks = c('non_split_sentence', 'no_endmark'))
elementals <- available_checks()[is_meta != TRUE,][['fun']]
check_text(
x,
checks = elementals[
!elementals %in% c('non_split_sentence', 'no_endmark')
]
)
y <- c("A valid sentence.", "yet another!")
check_text(y)
z <- rep("dfsdsd'nt", 120)
check_text(z)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.