llama_generate_batch: Generate completions for multiple prompts in parallel

View source: R/llama.R

llama_generate_batchR Documentation

Generate completions for multiple prompts in parallel

Description

Runs continuous batching: all prompts share the same decode loop, so each iteration dispatches one matmul over all still-running sequences. This converts decode from memory-bound vector ops into compute-bound matrix ops on the GPU and lifts throughput compared to calling llama_generate in a loop.

Usage

llama_generate_batch(
  ctx,
  prompts,
  max_new_tokens = 256L,
  temp = 0.8,
  top_k = 50L,
  top_p = 0.9,
  seed = 42L,
  min_p = 0,
  typical_p = 1,
  repeat_penalty = 1,
  repeat_last_n = 64L,
  frequency_penalty = 0,
  presence_penalty = 0,
  grammar = NULL
)

Arguments

ctx

Context handle returned by [llama_new_context], created with sufficient n_seq_max and n_ctx (see Details).

prompts

Character vector of prompts, one per parallel sequence.

max_new_tokens, temp, top_k, top_p, seed, min_p, typical_p, repeat_penalty, repeat_last_n, frequency_penalty, presence_penalty, grammar

Sampling parameters; see llama_generate. Shared across sequences. seed is offset per sequence (seed + s).

Details

The context must be created with n_seq_max >= length(prompts) and n_ctx large enough to hold every prompt plus its generated tokens simultaneously. As a rule of thumb: n_ctx >= sum(prompt_lengths) + length(prompts) * max_new_tokens.

Each sequence gets its own sampler chain seeded with seed + seq_index, so identical prompts still produce diverse outputs at temp > 0 (useful for self-consistency sampling). Sampler hyperparameters are shared across sequences in this version.

Stop conditions per sequence: end-of-generation token (model-defined) or max_new_tokens reached. Mirostat and with_timings are not supported here yet — use llama_generate for those.

Value

A list of length length(prompts), in the same order as the input. Each element is a list with fields:

  • text: character scalar with the generated text

  • n_tokens: integer count of tokens generated

  • finished_reason: "eos" or "max_tokens"

Examples

## Not run: 
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
# 4 parallel sequences, up to 256 new tokens each
ctx <- llama_new_context(model, n_ctx = 4096L, n_seq_max = 4L,
                         flash_attn = "on")

# Batch classification
prompts <- c("Classify: 'great movie' as positive/negative.",
             "Classify: 'awful service' as positive/negative.",
             "Classify: 'just okay' as positive/negative.",
             "Classify: 'loved every minute' as positive/negative.")
out <- llama_generate_batch(ctx, prompts, max_new_tokens = 16L, temp = 0)
vapply(out, `[[`, character(1), "text")

# Self-consistency sampling: same prompt repeated
samples <- llama_generate_batch(ctx, rep("2 + 2 =", 4L),
                                max_new_tokens = 8L, temp = 0.7)

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.