preprocess.removeNonWordChars: Remove Non-Word Characters

View source: R/preprocess_pipeline.R

preprocess.removeNonWordCharsR Documentation

Remove Non-Word Characters

Description

This function preprocesses a character vector by removing non-word characters and reports the mean number of characters before and after preprocessing.

Usage

preprocess.removeNonWordChars(
  text,
  rm.hashtags = FALSE,
  rm.mentions = FALSE,
  rm.emoji = FALSE,
  rm.digitwords = FALSE,
  join.hyphenation = FALSE
)

Arguments

text

A character vector that will be preprocessed.

rm.hashtags

A logical, defining if #hashtags should be removed.

rm.mentions

A logical, defining if @mentions should be removed.

rm.emoji

A logical, defining if emoji should be removed.

rm.digitwords

A logical, defining if all digits should be removed, including digitwords (e.g. 5G, T3, etc.)

join.hyphenation

A logical, defining if hyphenated words should be joined.

Details

By default URLs, html-entities (&nbsp), digits-words, apostrophized words, and all punctuation are removed.

Other preprocessing steps can be controlled via the arguments of the function.

Value

A preprocessed character vector.

Examples

## Not run: 
preprocess.removeNonWordChars(
text,
rm.hashtags=FALSE,
rm.mentions=FALSE,
rm.emoji=FALSE,
rm.digitwords=FALSE,
join.hyphenation=FALSE)

## End(Not run)


Kudusch/ktools documentation built on Oct. 30, 2022, 10:13 p.m.