| llama_generate_batch | R Documentation |
Runs continuous batching: all prompts share the same decode loop, so each
iteration dispatches one matmul over all still-running sequences. This
converts decode from memory-bound vector ops into compute-bound matrix ops
on the GPU and lifts throughput compared to calling llama_generate
in a loop.
llama_generate_batch(
ctx,
prompts,
max_new_tokens = 256L,
temp = 0.8,
top_k = 50L,
top_p = 0.9,
seed = 42L,
min_p = 0,
typical_p = 1,
repeat_penalty = 1,
repeat_last_n = 64L,
frequency_penalty = 0,
presence_penalty = 0,
grammar = NULL
)
ctx |
Context handle returned by [llama_new_context], created with
sufficient |
prompts |
Character vector of prompts, one per parallel sequence. |
max_new_tokens, temp, top_k, top_p, seed, min_p, typical_p, repeat_penalty, repeat_last_n, frequency_penalty, presence_penalty, grammar |
Sampling parameters; see |
The context must be created with n_seq_max >= length(prompts) and
n_ctx large enough to hold every prompt plus its generated tokens
simultaneously. As a rule of thumb:
n_ctx >= sum(prompt_lengths) + length(prompts) * max_new_tokens.
Each sequence gets its own sampler chain seeded with seed + seq_index,
so identical prompts still produce diverse outputs at temp > 0
(useful for self-consistency sampling). Sampler hyperparameters are shared
across sequences in this version.
Stop conditions per sequence: end-of-generation token (model-defined) or
max_new_tokens reached. Mirostat and with_timings are not
supported here yet — use llama_generate for those.
A list of length length(prompts), in the same order as the
input. Each element is a list with fields:
text: character scalar with the generated text
n_tokens: integer count of tokens generated
finished_reason: "eos" or "max_tokens"
## Not run:
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
# 4 parallel sequences, up to 256 new tokens each
ctx <- llama_new_context(model, n_ctx = 4096L, n_seq_max = 4L,
flash_attn = "on")
# Batch classification
prompts <- c("Classify: 'great movie' as positive/negative.",
"Classify: 'awful service' as positive/negative.",
"Classify: 'just okay' as positive/negative.",
"Classify: 'loved every minute' as positive/negative.")
out <- llama_generate_batch(ctx, prompts, max_new_tokens = 16L, temp = 0)
vapply(out, `[[`, character(1), "text")
# Self-consistency sampling: same prompt repeated
samples <- llama_generate_batch(ctx, rep("2 + 2 =", 4L),
max_new_tokens = 8L, temp = 0.7)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.