knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(tidyverse)
library(huggingfaceR)

Introduction to EZ Functions

huggingface models are trained to perform one of several tasks. For a list of tasks, you can run hf_list_tasks().

tasks <- hf_list_tasks()

tasks_len <- length(tasks)

next_multiple_of_3 <- tasks_len + (0:2)[(tasks_len + 0:2) %% 3 == 0]

tasks_tbl <- tasks %>% append(rep('', next_multiple_of_3 - tasks_len)) %>%  matrix(ncol = 3, byrow = TRUE) %>% as.data.frame()

knitr::kable(tasks_tbl, row.names = NA, col.names = NULL, caption = '**Huggingface Tasks**')

EZ functions are a simple way to get started with models for a given task. More advanced users familiar with huggingface models can load pipelines and models directly.

  1. Beginners --> hf_ez_{task}
  2. Intermediate Users --> hf_load_pipeline()
  3. Advanced Users --> hf_load_tokenizer() + hf_load_model() / hf_load_AutoModel_for_task()

So far huggingfaceR has ez functions for NLP tasks only.

tasks <- ls('package:huggingfaceR', pattern = "hf_ez_")

tasks_len <- length(tasks)

next_multiple_of_3 <- tasks_len + (0:2)[(tasks_len + 0:2) %% 3 == 0]

tasks_tbl <- tasks %>% append(rep('', next_multiple_of_3 - tasks_len)) %>%  matrix(ncol = 3, byrow = TRUE) %>% as.data.frame()

knitr::kable(tasks_tbl, row.names = NA, col.names = NULL, caption = '**Implemented EZ Functions**')

Zero Shot Sentiment Analysis

Let's say you'd like to automagically label the sentiment of a text as positive or negative. While you could use a text classification pre-trained model to do this, you can instead use a more generally useful zero-shot-classification model which can label text based on any labels you feed it. Then if you change your mind and decide that you'd like to change the labels to happy, ambivalent, and sad, you won't need to find a model explicitly trained on those sentiments!

For some sample data, we can search hugginface's datasets for a suitable set of texts.

datasets <- hf_list_datasets(pattern = "sentiment")

datasets

The sentiment140 dataset sounds like a good candidate. It consists of "Twitter messages with emoticons, which are used as noisy labels for sentiment classification". Let's load 50 rows from its training set, 25 with label 0 (negative) and 25 with label 4 (positive).

dataset <- hf_load_dataset(dataset = "sentiment140", split="train", label_conversion = NULL)

set.seed(1234)
(dataset <- dataset %>%
  slice_sample(n = 50))

We can use hf_ez_zero_shot_classification() to classify these texts as positive or negative and see how that matches up with the labels provided.

classifier <- hf_ez_zero_shot_classification(use_api = FALSE)

zero_shot_classifications <- 
  df %>%
  # Relabel sentiments with words
  mutate(sentiment = case_when(sentiment == 0 ~ 'negative', sentiment == 4 ~ 'positive'),
         results = map(text, function(txt){
           classifier$infer(string = txt, candidate_labels = c("positive sentiment", "negative sentiment")) %>% 
             select(-sequence) %>% 
             rename_with(.fn = ~ .x %>% str_remove(" sentiment"))
         })) %>% 
  unnest(results) %>%
  mutate(match = case_when((sentiment == 'positive' & positive < negative) | (sentiment == 'negative' & negative < positive) ~ FALSE, TRUE ~ TRUE))


knitr::kable(zero_shot_classifications %>% head(10), row.names = NA, caption = '**Zero Shot Classifications**', digits = 2)

The default model used by hf_ez_zero_shot_classification(), facebook/bart-large-mnli, does a great job of classifying these texts as being positive or negative (remember, it wasn't trained to do this explicitly!) with r zero_shot_classifications %>% count(match) %>% filter(match) %>% pull(n) of 50 matching the given label. Impressive!

Some of the misclassified texts look more ambivalent than positive or negative. We can easily check!

misclassified_texts <- 
  zero_shot_classifications %>%
  filter(!match) %>%
  # Relabel sentiments with words
  mutate(results = map(text, function(txt){
    classifier$infer(string = txt, candidate_labels = c("happy", "ambivalent", "sad")) %>% 
      select(-sequence) %>% 
      mutate(ambivalent = ambivalent > happy & ambivalent > sad)
  })) %>% 
  unnest(results) %>%
  select(-match, -happy, -sad)

kableExtra::kable(misclassified_texts, row.names = NA, caption = '**Zero Shot Misclassifications**', digits = 2)

It turns out that once ambivalent is added as an option, r misclassified_texts %>% filter(ambivalent) %>% nrow() of r misclassified_texts %>% nrow() misclassified texts are actually ambivalent. Note how we changed the labels to happy, ambivalent, and sad easily. If we were using traditional text classification we would have had to relabel the data and retrain the model. Zero shot classifiers are amazing!



farach/huggingfaceR documentation built on Feb. 4, 2023, 10:31 p.m.