tokenizer_basic: tokenizes a column in a dataframe

View source: R/tokenify.R

tokenizer_basicR Documentation

tokenizes a column in a dataframe

Description

tokenizes a column in a dataframe

Usage

tokenizer_basic(
  dat,
  ...,
  col_nm,
  row_name_nm,
  token_type = col_nm,
  token_col_nm = "token",
  drop_col = TRUE,
  token_index = "",
  pre_token_clean_str = clean_str,
  post_token_clean_Str = clean_str_2
)

Arguments

dat

dataframe. No default.

...

passed to both clean_str and tidytext::unnest_tokens

col_nm

string, name of column to tokenize

row_name_nm

string, name of column to put row_name into

token_type

string of the type of token for the given column. Default is col_nm

token_col_nm

String, column name of new tokens.

drop_col

Boolean. If True drops the original column, default = TRUE

token_index

String. name of column that will have index of order of tokens in origional column, Default ""

pre_token_clean_str

function. that takes vector of strings and ... cleans the string. will clean the string before tokenization. Default clean_str.

post_token_clean_Str

function. that takes vector of strings and ... cleans the string. will clean the string before tokenization. Default clean_str_2.

Examples

dat_ceo <- readr::read_csv('https://tinyurl.com/2p8etjr6')
dat_ceo |> tokenizer_basic(col_nm = 'exec_fullname', row_name_nm = 'rn', drop_col = FALSE) |> dplyr::select_at(c('token', 'exec_fullname'))



csps-efpc/TokenLink documentation built on Feb. 10, 2023, 3:30 a.m.