llama_generate_batch: Generate completions for multiple prompts in parallel
In llamaR: Interface for Large Language Models via 'llama.cpp'

llama_generate_batch

R Documentation

Generate completions for multiple prompts in parallel

Description

Runs continuous batching: all prompts share the same decode loop, so each iteration dispatches one matmul over all still-running sequences. This converts decode from memory-bound vector ops into compute-bound matrix ops on the GPU and lifts throughput compared to calling llama_generate in a loop.

Usage

llama_generate_batch(
  ctx,
  prompts,
  max_new_tokens = 256L,
  temp = 0.8,
  top_k = 50L,
  top_p = 0.9,
  seed = 42L,
  min_p = 0,
  typical_p = 1,
  repeat_penalty = 1,
  repeat_last_n = 64L,
  frequency_penalty = 0,
  presence_penalty = 0,
  grammar = NULL
)

Arguments

`ctx`	Context handle returned by [llama_new_context], created with sufficient `n_seq_max` and `n_ctx` (see Details).
`prompts`	Character vector of prompts, one per parallel sequence.
`max_new_tokens`, `temp`, `top_k`, `top_p`, `seed`, `min_p`, `typical_p`, `repeat_penalty`, `repeat_last_n`, `frequency_penalty`, `presence_penalty`, `grammar`	Sampling parameters; see `llama_generate`. Shared across sequences. `seed` is offset per sequence (`seed + s`).

Details

The context must be created with n_seq_max >= length(prompts) and n_ctx large enough to hold every prompt plus its generated tokens simultaneously. As a rule of thumb: n_ctx >= sum(prompt_lengths) + length(prompts) * max_new_tokens.

Each sequence gets its own sampler chain seeded with seed + seq_index, so identical prompts still produce diverse outputs at temp > 0 (useful for self-consistency sampling). Sampler hyperparameters are shared across sequences in this version.

Stop conditions per sequence: end-of-generation token (model-defined) or max_new_tokens reached. Mirostat and with_timings are not supported here yet — use llama_generate for those.

Value

A list of length length(prompts), in the same order as the input. Each element is a list with fields:

text: character scalar with the generated text
n_tokens: integer count of tokens generated
finished_reason: "eos" or "max_tokens"

Examples

## Not run: 
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
# 4 parallel sequences, up to 256 new tokens each
ctx <- llama_new_context(model, n_ctx = 4096L, n_seq_max = 4L,
                         flash_attn = "on")

# Batch classification
prompts <- c("Classify: 'great movie' as positive/negative.",
             "Classify: 'awful service' as positive/negative.",
             "Classify: 'just okay' as positive/negative.",
             "Classify: 'loved every minute' as positive/negative.")
out <- llama_generate_batch(ctx, prompts, max_new_tokens = 16L, temp = 0)
vapply(out, `[[`, character(1), "text")

# Self-consistency sampling: same prompt repeated
samples <- llama_generate_batch(ctx, rep("2 + 2 =", 4L),
                                max_new_tokens = 8L, temp = 0.7)

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.