check_text: Check Text For Potential Problems

View source: R/check_text.R

check_textR Documentation

Check Text For Potential Problems

Description

check_text - Uncleaned text may result in errors, warnings, and incorrect results in subsequent analysis. check_text checks text for potential problems and suggests possible fixes. Potential text anomalies that are detected include: factors, missing ending punctuation, empty cells, double punctuation, non-space after comma, no alphabetic characters, non-ASCII, missing value, and potentially misspelled words.

available_check - Provide a data.frame view of all the available checks in the check_text function.

Usage

check_text(x, file = NULL, checks = NULL, n = 10, ...)

available_checks()

Arguments

x

The text variable.

file

A connection, or a character string naming the file to print to. If NULL prints to the console. Note that this is assigned as an attribute and passed to print.

checks

A vector of checks to include from which_are. If checks = NULL, all checks from which_are which be used. Note that all meta checks will be conducted (see which_are for details on meta checks).

n

The number of affected elements to print out (the rest are truncated).

...

ignored.

Value

Returns a list with the following potential text faults report:

contraction

- Text elements that contain contractions

date

- Text elements that contain dates

digit

- Text elements that contain digits/numbers

email

- Text elements that contain email addresses

emoticon

- Text elements that contain emoticons

empty

- Text elements that contain empty text cells (all white space)

escaped

- Text elements that contain escaped back spaced characters

hash

- Text elements that contain Twitter style hash tags (e.g., #rstats)

html

- Text elements that contain HTML markup

incomplete

- Text elements that contain incomplete sentences (e.g., uses ending punctuation like ...)

kern

- Text elements that contain kerning (e.g., 'The B O M B!')

list_column

- Text variable that is a list column

missing_value

- Text elements that contain missing values

misspelled

- Text elements that contain potentially misspelled words

no_alpha

- Text elements that contain elements with no alphabetic (a-z) letters

no_endmark

- Text elements that contain elements with missing ending punctuation

no_space_after_comma

- Text elements that contain commas with no space afterwards

non_ascii

- Text elements that contain non-ASCII text

non_character

- Text variable that is not a character column (likely factor)

non_split_sentence

- Text elements that contain unsplit sentences (more than one sentence per element)

tag

- Text elements that contain Twitter style handle tags (e.g., @trinker)

time

- Text elements that contain timestamps

url

- Text elements that contain URLs

Note

The output is a list containing meta checks and elemental checks but prints as a pretty formatted output with potential problem elements, the accompanying text, and possible suggestions to fix the text.

Examples

## Not run: 
v <- list(c('foo', 'bar'), NA, c('hello', 'world'))
check_text(v)

w <- factor(unlist(v))
check_text(w)

x <- c("i like", "<p>i want. </p>thet them ther .", "I am ! that|", "", NA, 
    "&quot;they&quot;,were there", ".", "   ", "?", "3;", "I like goud eggs!", 
    "i 4like...", "\\tgreat",  'She said "yes"')
check_text(x)
print(check_text(x), include.text=FALSE)
check_text(x, checks = c('non_split_sentence', 'no_endmark'))
elementals <- available_checks()[is_meta != TRUE,][['fun']]
check_text(
    x, 
    checks = elementals[
        !elementals %in% c('non_split_sentence', 'no_endmark')
    ]
)

y <- c("A valid sentence.", "yet another!")
check_text(y)

z <- rep("dfsdsd'nt", 120)
check_text(z)

## End(Not run)

textclean documentation built on March 5, 2026, 9:06 a.m.