<!-- Copyright 2019 Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
RBERT is an implementation of Google Research's
BERT in R
.
BERT is a powerful general-purpose language model
(paper here,
helpful blog post here).
BERT is written in Python, using TensorFlow.
An R
package for TensorFlow already exists,
so the goal of this project is to fully implement BERT in R
down to the level
of the TensorFlow API.
Generally speaking, there are three levels at which BERT could be used:
Currently, RBERT is functional at the first level, and possibly functional at the second level (speed becomes a significant consideration at this level).
RBERT requires the tensorflow package to be installed and working. If that requirement is met, using RBERT at the first level is fairly straightforward.
library(RBERT) library(dplyr) # Download pre-trained BERT model. This will go to an appropriate cache # directory by default. BERT_PRETRAINED_DIR <- RBERT::download_BERT_checkpoint( model = "bert_base_uncased" ) text_to_process <- c("Impulse is equal to the change in momentum.", "Changing momentum requires an impulse.", "An impulse is like a push.", "Impulse is force times time.") # Or make two-segment examples: text_to_process2 <- list(c("Impulse is equal to the change in momentum.", "Changing momentum requires an impulse."), c("An impulse is like a push.", "Impulse is force times time.")) BERT_feats <- extract_features( examples = text_to_process2, ckpt_dir = BERT_PRETRAINED_DIR, layer_indexes = 1:12 ) # Extract the final layer output vector for the "[CLS]" token of the first # sentence. output_vector1 <- BERT_feats$output %>% dplyr::filter( sequence_index == 1, token == "[CLS]", layer_index == 12 ) %>% dplyr::select(dplyr::starts_with("V")) %>% unlist() output_vector1 # Extract output vectors for all sentences... # These vectors can be used as input features for downstream models. # Convenience functions for doing this extraction will be added to the # package in the near future. output_vectors <- BERT_feats$output %>% dplyr::filter(token_index == 1, layer_index == 12) output_vectors
RBERT exports a couple of functions that may be helpful when exploring BERT.
# Both of the functions below require a vocabulary (or a checkpoint containing a # vocab.txt file) to be specified. BERT_PRETRAINED_DIR <- download_BERT_checkpoint("bert_base_uncased") # `tokenize_text` is a quick way to see the wordpiece tokenization of some text. tokens <- tokenize_text(text = "Who doesn't like tacos?", ckpt_dir = BERT_PRETRAINED_DIR) # [[1]] # [1] "[CLS]" "who" "doesn" "'" "t" "like" "ta" "##cos" # [9] "?" "[SEP]" # `check_vocab` checks whether the given words are found in the vocabulary. check_vocab(words = c("positron", "electron"), ckpt_dir = BERT_PRETRAINED_DIR) # [1] FALSE TRUE
There's still a lot to do! Check out the issues board on the github page.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.