tok provides bindings to the 🤗tokenizers library. It uses the same Rust libraries that powers the Python implementation.
We still don’t provide the full API of tokenizers. Please open a issue if there’s a feature you are missing.
You can install tok from CRAN using:
install.packages("tok")
Installing tok from source requires working Rust toolchain. We recommend using rustup.
On Windows, you’ll also have to add the i686-pc-windows-gnu
and
x86_64-pc-windows-gnu
targets:
rustup target add x86_64-pc-windows-gnu
rustup target add i686-pc-windows-gnu
Once Rust is working, you can install this package via:
remotes::install_github("dfalbel/tok")
We still don’t have complete support for the 🤗tokenizers API. Please open an issue if you need a feature that is currently not implemented.
tok
can be used to load and use tokenizers that have been previously
serialized. For example, HuggingFace model weights are usually
accompanied by a ‘tokenizer.json’ file that can be loaded with this
library.
To load a pre-trained tokenizer from a json file, use:
path <- testthat::test_path("assets/tokenizer.json")
tok <- tok::tokenizer$from_file(path)
Use the encode
method to tokenize sentendes and decode
to transform
them back.
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"
You can also load any tokenizer available in HuggingFace hub by using
the from_pretrained
static method. For example, let’s load the GPT2
tokenizer with:
tok <- tok::tokenizer$from_pretrained("gpt2")
enc <- tok$encode("hello world")
tok$decode(enc$ids)
#> [1] "hello world"
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.