tokenizer_basic: tokenizes a column in a dataframe
In csps-efpc/TokenLink: Joins two dataframes using tokens or like words

tokenizer_basic

R Documentation

tokenizes a column in a dataframe

Description

tokenizes a column in a dataframe

Usage

tokenizer_basic(
  dat,
  ...,
  col_nm,
  row_name_nm,
  token_type = col_nm,
  token_col_nm = "token",
  drop_col = TRUE,
  token_index = "",
  pre_token_clean_str = clean_str,
  post_token_clean_Str = clean_str_2
)

Arguments

`dat`	dataframe. No default.
`...`	passed to both clean_str and tidytext::unnest_tokens
`col_nm`	string, name of column to tokenize
`row_name_nm`	string, name of column to put row_name into
`token_type`	string of the type of token for the given column. Default is col_nm
`token_col_nm`	String, column name of new tokens.
`drop_col`	Boolean. If True drops the original column, default = TRUE
`token_index`	String. name of column that will have index of order of tokens in origional column, Default ""
`pre_token_clean_str`	function. that takes vector of strings and ... cleans the string. will clean the string before tokenization. Default clean_str.
`post_token_clean_Str`	function. that takes vector of strings and ... cleans the string. will clean the string before tokenization. Default clean_str_2.

Examples

dat_ceo <- readr::read_csv('https://tinyurl.com/2p8etjr6')
dat_ceo |> tokenizer_basic(col_nm = 'exec_fullname', row_name_nm = 'rn', drop_col = FALSE) |> dplyr::select_at(c('token', 'exec_fullname'))

csps-efpc/TokenLink documentation built on Feb. 10, 2023, 3:30 a.m.