README.md
In tok: Fast Text Tokenization

tok

tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.

We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.

You can install tok from CRAN using:

install.packages("tok")

Installing tok from source requires working Rust toolchain. We recommend using rustup.

On Windows, you’ll also have to add the i686-pc-windows-gnu and x86_64-pc-windows-gnu targets:

rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu

Once Rust is working, you can install this package via:

remotes::install_github("dfalbel/tok")

We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.

tok can be used to load and use tokenizers that have been previously serialized. For example, HuggingFace model weights are usually accompanied by a ‘tokenizer.json’ file that can be loaded with this library.

To load a pre-trained tokenizer from a json file, use:

path <- testthat::test_path("assets/tokenizer.json")
tok <- tok::tokenizer$from_file(path)

Use the encode method to tokenize sentendes and decode to transform them back.

enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

You can also load any tokenizer available in HuggingFace hub by using the from_pretrained static method. For example, let’s load the GPT2 tokenizer with:

tok <- tok::tokenizer$from_pretrained("gpt2")
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"

Any scripts or data that you put into this service are public.

tok documentation built on Sept. 11, 2024, 5:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tok
Fast Text Tokenization

README.md
In tok: Fast Text Tokenization

tok

Installation

Features

Loading tokenizers

Using pre-trained tokenizers

Try the tok package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tok Fast Text Tokenization

README.md In tok: Fast Text Tokenization

tok

Installation

Features

Loading tokenizers

Using pre-trained tokenizers

Try the tok package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tok
Fast Text Tokenization

README.md
In tok: Fast Text Tokenization