| llama_generate | R Documentation |
Tokenizes the prompt, runs the full autoregressive decode loop with sampling, and returns the generated text (excluding the original prompt).
llama_generate(
ctx,
prompt,
max_new_tokens = 256L,
temp = 0.8,
top_k = 50L,
top_p = 0.9,
seed = 42L,
min_p = 0,
typical_p = 1,
repeat_penalty = 1,
repeat_last_n = 64L,
frequency_penalty = 0,
presence_penalty = 0,
mirostat = 0L,
mirostat_tau = 5,
mirostat_eta = 0.1,
grammar = NULL,
with_timings = FALSE
)
ctx |
Context handle returned by [llama_new_context] |
prompt |
Character string prompt |
max_new_tokens |
Maximum number of tokens to generate |
temp |
Sampling temperature. 0 = greedy decoding. |
top_k |
Top-K filtering (0 = disabled) |
top_p |
Top-P (nucleus) filtering (1.0 = disabled) |
seed |
Random seed for sampling |
min_p |
Min-P filtering threshold (0.0 = disabled) |
typical_p |
Locally typical sampling threshold (1.0 = disabled) |
repeat_penalty |
Repetition penalty (1.0 = disabled) |
repeat_last_n |
Number of last tokens to penalize (0 = disabled, -1 = context size) |
frequency_penalty |
Frequency penalty (0.0 = disabled) |
presence_penalty |
Presence penalty (0.0 = disabled) |
mirostat |
Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
mirostat_tau |
Mirostat target entropy (tau parameter) |
mirostat_eta |
Mirostat learning rate (eta parameter) |
grammar |
GBNF grammar string for constrained generation (NULL = disabled) |
with_timings |
If TRUE, attach a named numeric vector of per-stage timings (in ms) as attribute "timings" of the returned text. Stages: tokenize, build_sampler, kv_clear, prefill_dispatch, prefill_sync, gpu_sync (cumulative across decode-loop iterations), sample (cumulative), decode_dispatch (cumulative), detokenize, plus n_iterations and t_total_ms. Adds llama_synchronize calls inside the loop, so it is intended for profiling and may slightly slow generation. |
A character scalar containing the generated text (excluding the original prompt).
## Not run:
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)
ctx <- llama_new_context(model, n_ctx = 2048L)
# Basic generation
result <- llama_generate(ctx, "Once upon a time")
cat(result)
# Greedy decoding (deterministic)
result <- llama_generate(ctx, "The answer is", temp = 0)
# More creative output
result <- llama_generate(ctx, "Write a poem about R:",
max_new_tokens = 100L,
temp = 1.0, top_p = 0.95)
# With repetition penalty
result <- llama_generate(ctx, "List items:",
repeat_penalty = 1.1, repeat_last_n = 64L)
# JSON output with grammar
result <- llama_generate(ctx, "Output JSON:",
grammar = 'root ::= "{" "}" ')
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.