llama_new_context: Create an inference context

View source: R/llama.R

llama_new_contextR Documentation

Create an inference context

Description

Create an inference context

Usage

llama_new_context(
  model,
  n_ctx = 2048L,
  n_threads = NULL,
  n_threads_batch = NULL,
  n_batch = 2048L,
  n_ubatch = 512L,
  n_seq_max = 1L,
  flash_attn = "auto",
  embedding = FALSE
)

Arguments

model

Model handle returned by [llama_load_model]

n_ctx

Context window size (number of tokens). 0 means use the model's trained value.

n_threads

Number of CPU threads for single-token decode. NULL (default) picks 2L when a GPU backend is available, otherwise 4L.

n_threads_batch

Number of CPU threads for batch (prompt) processing. NULL (default) inherits from n_threads.

n_batch

Logical maximum batch size submitted to a single decode call (tokens). Default 2048L matches llama.cpp.

n_ubatch

Physical micro-batch size used inside decode. Larger values improve prefill throughput on GPU at the cost of memory. Default 512L.

n_seq_max

Maximum number of parallel sequences the context can hold simultaneously (KV cache is partitioned across them). Default 1L for single-prompt use; raise to N when using llama_generate_batch with N prompts. Increasing this does not by itself enlarge the context — also size n_ctx accordingly.

flash_attn

One of "auto" (let llama.cpp decide, default), "on" (force enable Flash Attention), or "off" (disable).

embedding

Logical; if TRUE, create context in embedding mode. This enables embedding output and disables causal attention, suitable for embedding models (e.g. nomic-embed, bge). When TRUE, llama_embed_batch uses efficient pooled batch decode.

Value

An external pointer (class externalptr) wrapping the inference context. This handle is required by generation, tokenization, and embedding functions. Freed automatically by the garbage collector or manually via llama_free_context.

Examples

## Not run: 
model <- llama_load_model("model.gguf")
ctx <- llama_new_context(model, n_ctx = 4096L, n_threads = 8L)
# ... use context for generation ...
llama_free_context(ctx)
llama_free_model(model)

# Tune for GPU prefill throughput
ctx <- llama_new_context(model, n_ctx = 4096L,
                         n_ubatch = 2048L, flash_attn = "on")

# Embedding mode
emb_ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE)
mat <- llama_embed_batch(emb_ctx, c("hello", "world"))

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.