check_vocab: Check Vocabulary

View source: R/tokenization.R

check_vocabR Documentation

Check Vocabulary

Description

Given some words and a word piece vocabulary, checks to see if the words are in the vocabulary.

Usage

check_vocab(words, ckpt_dir = NULL, vocab_file = find_vocab(ckpt_dir))

Arguments

words

Character vector; words to check.

ckpt_dir

Character; path to checkpoint directory. If specified, any other checkpoint files required by this function (vocab_file, bert_config_file, or init_checkpoint) will default to standard filenames within ckpt_dir.

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary.

Value

A logical vector containing TRUE if the corresponding word was found verbatim in the vocabulary, FALSE otherwise.

Examples

## Not run: 
BERT_PRETRAINED_DIR <- download_BERT_checkpoint("bert_base_uncased")
to_check <- c("apple", "appl")
check_vocab(words = to_check, ckpt_dir = BERT_PRETRAINED_DIR) # TRUE, FALSE
#'

## End(Not run)

jonathanbratt/RBERT documentation built on Jan. 26, 2023, 4:15 p.m.