wordpiece.data: Data for Wordpiece-Style Tokenization

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.

Getting started

Package details

AuthorJonathan Bratt [aut] (<https://orcid.org/0000-0003-2859-0076>), Jon Harmon [aut, cre] (<https://orcid.org/0000-0003-4781-4346>), Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)
MaintainerJon Harmon <jonthegeek@gmail.com>
LicenseApache License (>= 2)
Version2.0.0
URL https://github.com/macmillancontentscience/wordpiece.data
Package repositoryView on CRAN
Installation Install the latest version of this package by entering the following in R:
install.packages("wordpiece.data")

Try the wordpiece.data package in your browser

Any scripts or data that you put into this service are public.

wordpiece.data documentation built on March 18, 2022, 7:26 p.m.