clean_str: Replaces tokens, and cleans a string using regex stuff...
In csps-efpc/TokenLink: Joins two dataframes using tokens or like words

clean_str

R Documentation

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Description

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Usage

clean_str(
  x,
  ...,
  token_type,
  rep = read_replacements_token_type(token_type),
  remove_accents = TRUE,
  remove_punctuation = TRUE,
  iconv_to = "ASCII//TRANSLIT",
  punc_remove_patern = "[^[:alnum:][:cntrl:][:space:]_]",
  punc_replace = " ",
  new_token_wrapper = " "
)

Arguments

`x`	vector of strings
`...`	ignored, used to ensure pass by keyword
`token_type`	used to to try and load a default token replacement. no default
`rep`	dataframe with three columns indicating what to replace. default read_replacements_token_type(token_type)
`remove_accents`	bool. Default = TRUE.
`remove_punctuation`	bool. Default = TRUE.
`iconv_to`	passed to iconv as the to parameter if remove_accents is TRUE. Default = 'ASCII//TRANSLIT'
`punc_remove_patern`	string regex that finds punctuation to remove if remove_punctuation is TRUE. Default "[^[:alnum:][:cntrl:][:space:]_]"
`punc_replace`	string replaces all punctuation if remove_punctuation is TRUE. default " ",
`new_token_wrapper`	string. Placed on both sides of the new token. Default = " ".

Examples

c('Z.Y. do things inc', 'z. y. DO things montrèal', 'at&t') |> clean_str(token_type = 'company_name')

csps-efpc/TokenLink documentation built on Feb. 10, 2023, 3:30 a.m.

csps-efpc/TokenLink index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

csps-efpc/TokenLink
Joins two dataframes using tokens or like words

clean_str: Replaces tokens, and cleans a string using regex stuff...
In csps-efpc/TokenLink: Joins two dataframes using tokens or like words

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Description

Usage

Arguments

Examples

Related to clean_str in csps-efpc/TokenLink...

R Package Documentation

Browse R Packages

We want your feedback!

csps-efpc/TokenLink Joins two dataframes using tokens or like words

clean_str: Replaces tokens, and cleans a string using regex stuff... In csps-efpc/TokenLink: Joins two dataframes using tokens or like words

Replaces tokens, and cleans a string using regex stuff largely, and by doing search and replace. This is the the default string cleaner used before tokenization It can be overridden in tokenizer_basic, tokenize_col, tokenize_df, etc by passing a new function as pre_token_clean_str.

Description

Usage

Arguments

Examples

Related to clean_str in csps-efpc/TokenLink...

R Package Documentation

Browse R Packages

We want your feedback!

csps-efpc/TokenLink
Joins two dataframes using tokens or like words

clean_str: Replaces tokens, and cleans a string using regex stuff...
In csps-efpc/TokenLink: Joins two dataframes using tokens or like words