View source: R/utils_transformer.R
| calc_tokenizer_statistics | R Documentation |
Function for estimating the tokenizer statistics described by Kaya & Tantuğ (2024).
calc_tokenizer_statistics(
dataset,
step = "creation",
statistics_max_tokens_length = 512L
)
dataset |
Object of class datasets.arrow_dataset.Dataset. The data set must contain a column |
step |
|
statistics_max_tokens_length |
|
Returns a list with the following entries:
n_sequences: Number of sequences
n_words: Number for words in whole corpus
n_tokens: Number of tokens in the whole corpus
mu_t: eqn(n_tokens/n_sequences)
mu_w: eqn(n_words/n_sequences)
mu_g: eqn(n_tokens/n_words)
Kaya, Y. B., & Tantuğ, A. C. (2024). Effect of tokenization granularity for Turkish large language models. Intelligent Systems with Applications, 21, 200335. https://doi.org/10.1016/j.iswa.2024.200335
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.