clean_str | R Documentation |
Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.
clean_str( x, ..., token_type, rep = read_replacements_token_type(token_type), remove_accents = TRUE, remove_punctuation = TRUE, iconv_to = "ASCII//TRANSLIT", punc_remove_patern = "[^[:alnum:][:cntrl:][:space:]_]", punc_replace = " ", new_token_wrapper = " " )
x |
vector of strings |
... |
ignored, used to ensure pass by keyword |
token_type |
used to to try and load a default token replacement. no default |
rep |
dataframe with three columns indicating what to replace. default read_replacements_token_type(token_type) |
remove_accents |
bool. Default = TRUE. |
remove_punctuation |
bool. Default = TRUE. |
iconv_to |
passed to iconv as the to parameter if remove_accents is TRUE. Default = 'ASCII//TRANSLIT' |
punc_remove_patern |
string regex that finds punctuation to remove if remove_punctuation is TRUE. Default "[^[:alnum:][:cntrl:][:space:]_]" |
punc_replace |
string replaces all punctuation if remove_punctuation is TRUE. default " ", |
new_token_wrapper |
string. Placed on both sides of the new token. Default = " ". |
c('Z.Y. do things inc', 'z. y. DO things montrèal', 'at&t') |> clean_str(token_type = 'company_name')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.