clean_str: Replaces tokens, and cleans a string using regex stuff...

View source: R/tokenify.R

clean_strR Documentation

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Description

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Usage

clean_str(
  x,
  ...,
  token_type,
  rep = read_replacements_token_type(token_type),
  remove_accents = TRUE,
  remove_punctuation = TRUE,
  iconv_to = "ASCII//TRANSLIT",
  punc_remove_patern = "[^[:alnum:][:cntrl:][:space:]_]",
  punc_replace = " ",
  new_token_wrapper = " "
)

Arguments

x

vector of strings

...

ignored, used to ensure pass by keyword

token_type

used to to try and load a default token replacement. no default

rep

dataframe with three columns indicating what to replace. default read_replacements_token_type(token_type)

remove_accents

bool. Default = TRUE.

remove_punctuation

bool. Default = TRUE.

iconv_to

passed to iconv as the to parameter if remove_accents is TRUE. Default = 'ASCII//TRANSLIT'

punc_remove_patern

string regex that finds punctuation to remove if remove_punctuation is TRUE. Default "[^[:alnum:][:cntrl:][:space:]_]"

punc_replace

string replaces all punctuation if remove_punctuation is TRUE. default " ",

new_token_wrapper

string. Placed on both sides of the new token. Default = " ".

Examples

c('Z.Y. do things inc', 'z. y. DO things montrèal', 'at&t') |> clean_str(token_type = 'company_name')


csps-efpc/TokenLink documentation built on Feb. 10, 2023, 3:30 a.m.