llama_gen_begin: Begin a streaming (token-by-token) generation

View source: R/llama.R

llama_gen_beginR Documentation

Begin a streaming (token-by-token) generation

Description

Sets up sampling and prefills the prompt, returning an opaque state handle that is pulled one chunk at a time with [llama_gen_next]. This is the streaming counterpart to [llama_generate]: same sampler chain and the same output for a given seed, but text arrives incrementally so it can be pushed into an SSE stream as it is produced.

Usage

llama_gen_begin(
  ctx,
  prompt,
  max_new_tokens = 256L,
  temp = 0.8,
  top_k = 50L,
  top_p = 0.9,
  seed = 42L,
  min_p = 0,
  typical_p = 1,
  repeat_penalty = 1,
  repeat_last_n = 64L,
  frequency_penalty = 0,
  presence_penalty = 0,
  mirostat = 0L,
  mirostat_tau = 5,
  mirostat_eta = 0.1,
  grammar = NULL
)

Arguments

ctx

Context handle returned by [llama_new_context]

prompt

Character string prompt

max_new_tokens

Maximum number of tokens to generate

temp

Sampling temperature. 0 = greedy decoding.

top_k

Top-K filtering (0 = disabled)

top_p

Top-P (nucleus) filtering (1.0 = disabled)

seed

Random seed for sampling

min_p

Min-P filtering threshold (0.0 = disabled)

typical_p

Locally typical sampling threshold (1.0 = disabled)

repeat_penalty

Repetition penalty (1.0 = disabled)

repeat_last_n

Number of last tokens to penalize (0 = disabled, -1 = context size)

frequency_penalty

Frequency penalty (0.0 = disabled)

presence_penalty

Presence penalty (0.0 = disabled)

mirostat

Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)

mirostat_tau

Mirostat target entropy (tau parameter)

mirostat_eta

Mirostat learning rate (eta parameter)

grammar

GBNF grammar string for constrained generation (NULL = disabled)

Details

Typical loop:

st <- llama_gen_begin(ctx, prompt)
repeat {
  chunk <- llama_gen_next(st)
  if (is.null(chunk)) break
  cat(chunk)
}
cat(llama_gen_end(st))  # flush any held-back trailing bytes

Only one streaming generation may be active per context at a time: each call to llama_gen_begin clears the context KV cache.

Value

An external pointer holding the generation state. Pass it to [llama_gen_next] and [llama_gen_end]. The underlying sampler is freed automatically by the garbage collector.

See Also

[llama_gen_next], [llama_gen_end], [llama_generate]


llamaR documentation built on May 28, 2026, 1:06 a.m.