stri_process: A wrapper function for various preprocessing options for...

View source: R/stri_process.R

stri_processR Documentation

A wrapper function for various preprocessing options for strings

Description

A wrapper function for various preprocessing options for strings

Usage

stri_process(x, force_encoding = "UTF-8", alltolower = FALSE,
  erase_patterns = NULL, token_exclude_length = NULL,
  rm_diacritics = FALSE, replace_dashes_hyphens_by = NULL,
  rm_roman_numeral_listing = FALSE, replace_by_blank_regex = NULL,
  erase_regex = NULL, harmonize_blanks = FALSE)

Arguments

x

A character vector.

force_encoding

The encdoding to be forced on the string.

alltolower

Turn all letters to lower case.

erase_patterns

Fixed non-regex patterns to be erased from text as is. The force_encoding and tolower settings are applied on these patterns before matching for removal.

token_exclude_length

Remove tokens that have specified number of characters or less, enclosed by word boundaries.

rm_diacritics

Turn diacritics into their ASCII pendnant.

replace_dashes_hyphens_by

Various forms of dashes and hyphens, e.g., long dash, dash, hyphen, etc., defined in Unicode table are replaced by the sepcified fixed pattern.

rm_roman_numeral_listing

Erase all brackets and their content if bracket includes a combination of i,v, and x. There are also higher number that require M and C, however, functions aims at listing of lower numbers usually used in reports. More sophisticated regex replacements possible with below parameter.

replace_by_blank_regex

A regex pattern to be replaced by a blank. Use "|" to replace more than one pattern.

erase_regex

A regex pattern to be replaced by nothing, i.e., "". Use "|" to replace more than one pattern.

harmonize_blanks

Remove blanks at the begining and end of a string and collapses sequences of multiple blanks into one.

Value

The processed string.


manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.