llama_generate: Generate text from a prompt
In llamaR: Interface for Large Language Models via 'llama.cpp'

llama_generate

R Documentation

Generate text from a prompt

Description

Tokenizes the prompt, runs the full autoregressive decode loop with sampling, and returns the generated text (excluding the original prompt).

Usage

llama_generate(
  ctx,
  prompt,
  max_new_tokens = 256L,
  temp = 0.8,
  top_k = 50L,
  top_p = 0.9,
  seed = 42L,
  min_p = 0,
  typical_p = 1,
  repeat_penalty = 1,
  repeat_last_n = 64L,
  frequency_penalty = 0,
  presence_penalty = 0,
  mirostat = 0L,
  mirostat_tau = 5,
  mirostat_eta = 0.1,
  grammar = NULL,
  with_timings = FALSE
)

Arguments

`ctx`	Context handle returned by [llama_new_context]
`prompt`	Character string prompt
`max_new_tokens`	Maximum number of tokens to generate
`temp`	Sampling temperature. 0 = greedy decoding.
`top_k`	Top-K filtering (0 = disabled)
`top_p`	Top-P (nucleus) filtering (1.0 = disabled)
`seed`	Random seed for sampling
`min_p`	Min-P filtering threshold (0.0 = disabled)
`typical_p`	Locally typical sampling threshold (1.0 = disabled)
`repeat_penalty`	Repetition penalty (1.0 = disabled)
`repeat_last_n`	Number of last tokens to penalize (0 = disabled, -1 = context size)
`frequency_penalty`	Frequency penalty (0.0 = disabled)
`presence_penalty`	Presence penalty (0.0 = disabled)
`mirostat`	Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
`mirostat_tau`	Mirostat target entropy (tau parameter)
`mirostat_eta`	Mirostat learning rate (eta parameter)
`grammar`	GBNF grammar string for constrained generation (NULL = disabled)
`with_timings`	If TRUE, attach a named numeric vector of per-stage timings (in ms) as attribute "timings" of the returned text. Stages: tokenize, build_sampler, kv_clear, prefill_dispatch, prefill_sync, gpu_sync (cumulative across decode-loop iterations), sample (cumulative), decode_dispatch (cumulative), detokenize, plus n_iterations and t_total_ms. Adds llama_synchronize calls inside the loop, so it is intended for profiling and may slightly slow generation.

Value

A character scalar containing the generated text (excluding the original prompt).

Examples

## Not run: 
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
ctx <- llama_new_context(model, n_ctx = 2048L)

# Basic generation
result <- llama_generate(ctx, "Once upon a time")
cat(result)

# Greedy decoding (deterministic)
result <- llama_generate(ctx, "The answer is", temp = 0)

# More creative output
result <- llama_generate(ctx, "Write a poem about R:",
                         max_new_tokens = 100L,
                         temp = 1.0, top_p = 0.95)

# With repetition penalty
result <- llama_generate(ctx, "List items:",
                         repeat_penalty = 1.1, repeat_last_n = 64L)

# JSON output with grammar
result <- llama_generate(ctx, "Output JSON:",
                         grammar = 'root ::= "{" "}" ')

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.