In farach/huggingfaceR: Hugging Face State-of-the-Art Models in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Aside from wrapping Hugging Face's transformers library, huggingfaceR also wraps around the sentence_transformers library to allow SOTA document embeddings.

library(huggingfaceR)
library(dplyr)
library(stringr)

Initiating a Sentence Transformers Pipeline

It's incredibly simple to use sentence_transformers in huggingfaceR. You just need one function: hf_load_sentence_model and a model_id. To begin with, we'll use the 'paraphrase-MiniLM-L6-v2' as our model:

model_id <- "paraphrase-MiniLM-L6-v2"

minilm_l6 <- hf_load_sentence_model(model_id)

Extracting document/sentence embeddings with encode

The first, and for many the only, thing a user will want to do is feed in some document(s) and receive the embedding(s). We can do that by using R's $ syntax to access our sentence transformer's encode class method - if unfamiliar with OOP/Python, just think of class methods as functions.

But first, we'll need a document:

document <- c("Many people think Lionel Messi is the greatest footballer to have ever played the Beautiful Game.")

embedding <- minilm_l6$encode(document)

Tidying the output

Calling our model on one document returns a 384 length vector. We could tidy it into a tibble by first transposing:

embedding %>%
  t()  %>%
  as_tibble()

Most likely we're not going to want to embed just one document, but many. Let's use stringr::sentences as a test run. There are 720 sentences which we'll embed in one go. Because we're embedding a number of sentences, we'll set show_progress_bar = TRUE, we'll also change the batch_size to 64L - although the default setting of 32L would be fine too.

start <- proc.time()
sentences_embeddings <- minilm_l6$encode(stringr::sentences,show_progress_bar = TRUE,
                                         batch_size = 64L)
proc.time() - start

The process took about 2 seconds start to finish, not bad! We can tidy up our results similarly to before:

sentences_embeddings %>%
  t() %>%
  as_tibble()

Using Torch to send the model to the GPU

However, when dealing with more and longer documents, or when using a larger model, the proces can take significantly longer. In such cases, it's a good idea to use a GPU if you have access to one. The author is currently using a Macbook with an M1 chip. There is ongoing work by the Pytorch team to accelerate MPS chips, currently we can send some models to our GPU and some not - most Hugging Face models require aten::cumsum.out for which the backend integration with Apple Silicon chips has not yet been written. The process ought to be similar, and simpler with an NVIDIA/Cuda GPU.

You will need to have the appropriate torch version installed for you GPU.

reticulate::py_run_string("import torch")

device <- hf_set_device()

minilm_gpu <- hf_load_sentence_model(model_id)
minilm_gpu$to(device)

Now we're ready to call our model with GPU acceleration:

sentences_embeddings_gpu <- minilm_gpu$encode(stringr::sentences, device = device, show_progress_bar = TRUE, batch_size = 64L)

You'll notice that it actually takes longer with the GPU - that's because of the set up costs involved. With a larger dataset or model you should expect to see at least a 3x speed up when using an M1 chip, and more with an NVIDIA/cuda gpu (depending on hardware of course).

Let's say you wanted a different model to "paraphrase-MiniLM-L6-v2". You could use huggingfaceR::models_with_downloads to select a model based on downloads:

models_with_downloads %>%
  filter(task == "sentence-similarity")

Instantiating a more powerful model

model_id <- "sentence-transformers/stsb-distilbert-base"
st_distilbert <- hf_load_sentence_model(model_id)

And you could extract embeddings using the same methods as before. Note if using RStudio - when calling model\$encode() you can use tab to access available arguments.

st_embeddings <- st_distilbert$encode(stringr::sentences, show_progress_bar = TRUE)

Voila, for most use cases you should be covered.

Advanced Users

Let's take a step back, we can use R's familiar $ syntax to access our model's class methods and config.

names(minilm_l6)

Our original model outputs 384 dimensional embeddings, and accepts up to 128 tokens:

c(minilm_l6$get_sentence_embedding_dimension(),minilm_l6$max_seq_length)

Our next model outputs 768 dimensional embeddings and also accepts up to 128 tokens:

c(st_distilbert$get_sentence_embedding_dimension(), st_distilbert$max_seq_length)

There's a lot you can do/look at - for example you can get your model's tokenizer's full vocabulary:

model_vocab <- st_distilbert$tokenizer$vocab

model_vocab_df <- tibble(
  token = names(model_vocab),
  id = unlist(model_vocab, recursive = FALSE)
)

model_vocab_df