llama_generate: Generate text from a prompt

View source: R/llama.R

llama_generateR Documentation

Generate text from a prompt

Description

Tokenizes the prompt, runs the full autoregressive decode loop with sampling, and returns the generated text (excluding the original prompt).

Usage

llama_generate(
  ctx,
  prompt,
  max_new_tokens = 256L,
  temp = 0.8,
  top_k = 50L,
  top_p = 0.9,
  seed = 42L,
  min_p = 0,
  typical_p = 1,
  repeat_penalty = 1,
  repeat_last_n = 64L,
  frequency_penalty = 0,
  presence_penalty = 0,
  mirostat = 0L,
  mirostat_tau = 5,
  mirostat_eta = 0.1,
  grammar = NULL,
  with_timings = FALSE
)

Arguments

ctx

Context handle returned by [llama_new_context]

prompt

Character string prompt

max_new_tokens

Maximum number of tokens to generate

temp

Sampling temperature. 0 = greedy decoding.

top_k

Top-K filtering (0 = disabled)

top_p

Top-P (nucleus) filtering (1.0 = disabled)

seed

Random seed for sampling

min_p

Min-P filtering threshold (0.0 = disabled)

typical_p

Locally typical sampling threshold (1.0 = disabled)

repeat_penalty

Repetition penalty (1.0 = disabled)

repeat_last_n

Number of last tokens to penalize (0 = disabled, -1 = context size)

frequency_penalty

Frequency penalty (0.0 = disabled)

presence_penalty

Presence penalty (0.0 = disabled)

mirostat

Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)

mirostat_tau

Mirostat target entropy (tau parameter)

mirostat_eta

Mirostat learning rate (eta parameter)

grammar

GBNF grammar string for constrained generation (NULL = disabled)

with_timings

If TRUE, attach a named numeric vector of per-stage timings (in ms) as attribute "timings" of the returned text. Stages: tokenize, build_sampler, kv_clear, prefill_dispatch, prefill_sync, gpu_sync (cumulative across decode-loop iterations), sample (cumulative), decode_dispatch (cumulative), detokenize, plus n_iterations and t_total_ms. Adds llama_synchronize calls inside the loop, so it is intended for profiling and may slightly slow generation.

Value

A character scalar containing the generated text (excluding the original prompt).

Examples

## Not run: 
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
ctx <- llama_new_context(model, n_ctx = 2048L)

# Basic generation
result <- llama_generate(ctx, "Once upon a time")
cat(result)

# Greedy decoding (deterministic)
result <- llama_generate(ctx, "The answer is", temp = 0)

# More creative output
result <- llama_generate(ctx, "Write a poem about R:",
                         max_new_tokens = 100L,
                         temp = 1.0, top_p = 0.95)

# With repetition penalty
result <- llama_generate(ctx, "List items:",
                         repeat_penalty = 1.1, repeat_last_n = 64L)

# JSON output with grammar
result <- llama_generate(ctx, "Output JSON:",
                         grammar = 'root ::= "{" "}" ')

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.