knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(lexR)
The first set of functions from the lexR package are meant to speed up the text cleaning process by setting specific defaults and applying several distict commands at once.
The function clean_tags() removes all HTML tags from a string:
tagged <- "<div class='class'> <p> This is an example of <bold>text with HTML tags</bold> which is therefore <red>difficult to read</red> and full of useless information. </p> </div>" print(tagged)
Let's apply the clean_tags() function:
cleaned <- lexR::clean_tags(tagged) print(cleaned)
Another frequently occuring issue are improper spacing: spaces before punctuation or none after; dashes separated from the words they connect by spaces; spaces after parentheses or not before, etc. The function clean_spaces enforce proper spacing:
improper_spacing <- "In this text ,there are severial issues( all of them about spacing ) which require some auto - correction ." print(improper_spacing)
Let's apply the clean_tags() function:
proper_spacing <- lexR::clean_spaces(improper_spacing) print(proper_spacing)
Another frequent issues when texts come from external documents is that paragraphs may be broken. The function clean_paragraphs() identifies what might be a broken paragraph (e.g. an sentences does not end with a full stop and there is no line break afterwards):
broken <- c( "This paragraph has", "several sentences.", "This is", "one.", "And this is another.", "", "The next", "paragraph is different.", "And it should be", "recomposed and kept distinct", "from the previous paragraph.", "", "There were the two first", "paragrahps (deliberate typo)", "of this example." ) print(broken)
Let's apply the clean_paragraphs() function:
recomposed <- lexR::clean_paragraphs(texts = broken) print(recomposed)
Note that contrary to all the other functions presented here, clean_paragraphs() takes a character vector and not a single string of characters.
Sometimes, the analyst does not want to analyze complete texts, but the surroundings of a specific term. The function clean_windows() extracts and returns windows of n words around a regex pattern:
merged <- paste(recomposed, collapse = " ") windows <- lexR::clean_windows(text = merged, pattern = "paragraph", before = 2, after = 2) print(windows)
Note that the number of words before and after can be adjusted separately:
windows <- lexR::clean_windows(text = merged, pattern = "paragraph", before = 3, after = 1) print(windows)
Moreover, a regex pattern can be used to capture slight variations arounf a pattern:
windows <- lexR::clean_windows(text = merged, pattern = "paragra*", before = 3, after = 1) print(windows)
The function clean_replace() can be used to replace synonyms with a common term, or even to correct frequent mistakes:
corrections <- tibble::tibble( pattern = "paragrahps", replacement = "paragraphs" ) corrected <- clean_replace(text = merged, replacements = corrections)
As you can see, the typo is not corrected:
windows <- lexR::clean_windows(text = corrected, pattern = "paragraph", before = 3, after = 1) print(windows)
Note that this function requires exact matches: it does not rely on regular expressions.
Sometimes, the text is not encoded in ASCII and special charactes can results in bugs because the encoding is not supported. To mitigate this issue, the function clean_ascii() replaces non-ASCII characters by a close equivalent en force the ASCI encoding on the text:
non_ascii <- "Ils sont arrivés à 12:30 à l'hôpital où leur garçon, Gwenaël, devait être soigné, et ils y patientèrent 2 heures." validUTF8(non_ascii)
Let's apply the clean_tags() function:
ascii <- lexR::clean_ascii(non_ascii) validUTF8(ascii) print(ascii)
Finally, some analyses do not require much information beyond words. The function clean_letters removes any punctuation, number, etc. and convert the string to lowecase letters:
lowercase_letters <- lexR::clean_letters(ascii) print(lowercase_letters)
The function clean_spaces() can then buse to to remove all unnecessary spaces:
lowercase_letters <- lexR::clean_spaces(lowercase_letters) print(lowercase_letters)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.