stri_process: A wrapper function for various preprocessing options for...
In manuelbickel/textility: Utility functions for text mining

stri_process

R Documentation

A wrapper function for various preprocessing options for strings

Description

A wrapper function for various preprocessing options for strings

Usage

stri_process(x, force_encoding = "UTF-8", alltolower = FALSE,
  erase_patterns = NULL, token_exclude_length = NULL,
  rm_diacritics = FALSE, replace_dashes_hyphens_by = NULL,
  rm_roman_numeral_listing = FALSE, replace_by_blank_regex = NULL,
  erase_regex = NULL, harmonize_blanks = FALSE)

Arguments

`x`	A `character` vector.
`force_encoding`	The encdoding to be forced on the string.
`alltolower`	Turn all letters to lower case.
`erase_patterns`	Fixed non-regex patterns to be erased from text as is. The `force_encoding` and `tolower` settings are applied on these patterns before matching for removal.
`token_exclude_length`	Remove tokens that have specified number of characters or less, enclosed by word boundaries.
`rm_diacritics`	Turn diacritics into their ASCII pendnant.
`replace_dashes_hyphens_by`	Various forms of dashes and hyphens, e.g., long dash, dash, hyphen, etc., defined in Unicode table are replaced by the sepcified fixed pattern.
`rm_roman_numeral_listing`	Erase all brackets and their content if bracket includes a combination of i,v, and x. There are also higher number that require M and C, however, functions aims at listing of lower numbers usually used in reports. More sophisticated regex replacements possible with below parameter.
`replace_by_blank_regex`	A regex pattern to be replaced by a blank. Use "\|" to replace more than one pattern.
`erase_regex`	A regex pattern to be replaced by nothing, i.e., "". Use "\|" to replace more than one pattern.
`harmonize_blanks`	Remove blanks at the begining and end of a string and collapses sequences of multiple blanks into one.