batchLLM: Batch Process LLM Text Completions Using a Data Frame

View source: R/batchLLM.R

batchLLMR Documentation

Batch Process LLM Text Completions Using a Data Frame

Description

Batch process large language model (LLM) text completions by looping across the rows of a data frame column. The package currently supports OpenAI's GPT, Anthropic's Claude, and Google's Gemini models, with built-in delays for API rate limiting. The package provides advanced text processing features, including automatic logging of batches and metadata to local files, side-by-side comparison of outputs from different LLMs, and integration of a user-friendly Shiny App Addin. Use cases include natural language processing tasks such as sentiment analysis, thematic analysis, classification, labeling or tagging, and language translation.

Usage

batchLLM(
  df,
  df_name = NULL,
  col,
  prompt,
  LLM = "openai",
  model = "gpt-4o-mini",
  temperature = 0.5,
  max_tokens = 500,
  batch_delay = "random",
  batch_size = 10,
  case_convert = NULL,
  sanitize = FALSE,
  attempts = 1,
  log_name = "batchLLM-log",
  hash_algo = "crc32c",
  ...
)

Arguments

df

A data frame that contains the input data.

df_name

An optional string specifying the name of the data frame to log. This is particularly useful in Shiny applications or when the data frame is passed programmatically rather than explicitly. Default is NULL.

col

The name of the column in the data frame to process.

prompt

A system prompt for the LLM model.

LLM

A string for the name of the LLM with the options: "openai", "anthropic", and "google". Default is "openai".

model

A string for the name of the model from the LLM. Default is "gpt-4o-mini".

temperature

A temperature for the LLM model. Default is .5.

max_tokens

A maximum number of tokens to generate before stopping. Default is 500.

batch_delay

A string for the batch delay with the options: "random", "min", and "sec". Numeric examples include "1min" and "30sec". Default is "random" which is an average of 10.86 seconds (n = 1,000 simulations).

batch_size

The number of rows to process in each batch. Default is 10.

case_convert

A string for the case conversion of the output with the options: "upper", "lower", or NULL (no change). Default is NULL.

sanitize

Extract the LLM text completion from the model's response by returning only content in <result> XML tags. Additionally, remove all punctuation. This feature prevents unwanted text (e.g., preamble) or punctuation from being included in the model's output. Default is FALSE.

attempts

The maximum number of loop retry attempts. Default is 1.

log_name

A string for the name of the log without the .rds file extension. Default is "batchLLM-log".

hash_algo

A string for a hashing algorithm from the 'digest' package. Default is crc32c.

...

Additional arguments to pass on to the LLM API function.

Value

Returns the input data frame with an additional column containing the text completion output. The function also writes the output and metadata to the log file after each batch in a nested list format.

Examples

## Not run: 
library(batchLLM)

# Set API keys
Sys.setenv(OPENAI_API_KEY = "your_openai_api_key")
Sys.setenv(ANTHROPIC_API_KEY = "your_anthropic_api_key")
Sys.setenv(GEMINI_API_KEY = "your_gemini_api_key")

# Define LLM configurations
llm_configs <- list(
  list(LLM = "openai", model = "gpt-4o-mini"),
  list(LLM = "anthropic", model = "claude-3-haiku-20240307"),
  list(LLM = "google", model = "1.5-flash")
)

# Apply batchLLM function to each configuration
beliefs <- lapply(llm_configs, function(config) {
  batchLLM(
    df = beliefs,
    col = statement,
    prompt = "classify as a fact or misinformation in one word",
    LLM = config$LLM,
    model = config$model,
    batch_size = 10,
    batch_delay = "1min",
    case_convert = "lower"
  )
})[[length(llm_configs)]]

# Print the updated data frame
print(beliefs)

## End(Not run)

batchLLM documentation built on Oct. 14, 2024, 5:09 p.m.

Related to batchLLM in batchLLM...