In jonathanbratt/RBERT: R Implementation of BERT

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

RBERT is an implementation of Google Research's BERT in R.

BERT is a powerful general-purpose language model (paper here, helpful blog post here). BERT is written in Python, using TensorFlow. An R package for TensorFlow already exists, so the goal of this project is to fully implement BERT in R down to the level of the TensorFlow API.

Generally speaking, there are three levels at which BERT could be used:

Using the output of a pre-trained BERT model as features for downstream model
Fine-tuning on top of a pre-trained BERT model
Training a BERT model from scratch

Currently, RBERT is functional at the first level, and possibly functional at the second level (speed becomes a significant consideration at this level).

Getting started with RBERT

RBERT requires the tensorflow package to be installed and working. If that requirement is met, using RBERT at the first level is fairly straightforward.

library(RBERT)
library(dplyr)

# Download pre-trained BERT model. This will go to an appropriate cache
# directory by default.
BERT_PRETRAINED_DIR <- RBERT::download_BERT_checkpoint(
  model = "bert_base_uncased"
)

text_to_process <- c("Impulse is equal to the change in momentum.",
                     "Changing momentum requires an impulse.",
                     "An impulse is like a push.",
                     "Impulse is force times time.")

# Or make two-segment examples:
text_to_process2 <- list(c("Impulse is equal to the change in momentum.",
                           "Changing momentum requires an impulse."),
                         c("An impulse is like a push.",
                           "Impulse is force times time."))

BERT_feats <- extract_features(
  examples = text_to_process2,
  ckpt_dir = BERT_PRETRAINED_DIR,
  layer_indexes = 1:12
)

# Extract the final layer output vector for the "[CLS]" token of the first
# sentence. 
output_vector1 <- BERT_feats$output %>%
  dplyr::filter(
    sequence_index == 1, 
    token == "[CLS]", 
    layer_index == 12
  ) %>% 
  dplyr::select(dplyr::starts_with("V")) %>% 
  unlist()
output_vector1

# Extract output vectors for all sentences...
# These vectors can be used as input features for downstream models.
# Convenience functions for doing this extraction will be added to the
# package in the near future.
output_vectors <- BERT_feats$output %>% 
  dplyr::filter(token_index == 1, layer_index == 12)
output_vectors

Other Functions

RBERT exports a couple of functions that may be helpful when exploring BERT.

# Both of the functions below require a vocabulary (or a checkpoint containing a
# vocab.txt file) to be specified.
BERT_PRETRAINED_DIR <- download_BERT_checkpoint("bert_base_uncased")

# `tokenize_text` is a quick way to see the wordpiece tokenization of some text.
tokens <- tokenize_text(text = "Who doesn't like tacos?",
                        ckpt_dir = BERT_PRETRAINED_DIR)
# [[1]]
#  [1] "[CLS]" "who"   "doesn" "'"     "t"     "like"  "ta"    "##cos"
#  [9] "?"     "[SEP]"

# `check_vocab` checks whether the given words are found in the vocabulary.
check_vocab(words = c("positron", "electron"), ckpt_dir = BERT_PRETRAINED_DIR)
# [1] FALSE  TRUE