llama_new_context: Create an inference context
In llamaR: Interface for Large Language Models via 'llama.cpp'

llama_new_context

R Documentation

Create an inference context

Description

Create an inference context

Usage

llama_new_context(
  model,
  n_ctx = 2048L,
  n_threads = NULL,
  n_threads_batch = NULL,
  n_batch = 2048L,
  n_ubatch = 512L,
  n_seq_max = 1L,
  flash_attn = "auto",
  embedding = FALSE
)

Arguments

`model`	Model handle returned by [llama_load_model]
`n_ctx`	Context window size (number of tokens). 0 means use the model's trained value.
`n_threads`	Number of CPU threads for single-token decode. `NULL` (default) picks `2L` when a GPU backend is available, otherwise `4L`.
`n_threads_batch`	Number of CPU threads for batch (prompt) processing. `NULL` (default) inherits from `n_threads`.
`n_batch`	Logical maximum batch size submitted to a single decode call (tokens). Default `2048L` matches llama.cpp.
`n_ubatch`	Physical micro-batch size used inside decode. Larger values improve prefill throughput on GPU at the cost of memory. Default `512L`.
`n_seq_max`	Maximum number of parallel sequences the context can hold simultaneously (KV cache is partitioned across them). Default `1L` for single-prompt use; raise to `N` when using `llama_generate_batch` with `N` prompts. Increasing this does not by itself enlarge the context — also size `n_ctx` accordingly.
`flash_attn`	One of `"auto"` (let llama.cpp decide, default), `"on"` (force enable Flash Attention), or `"off"` (disable).
`embedding`	Logical; if `TRUE`, create context in embedding mode. This enables embedding output and disables causal attention, suitable for embedding models (e.g. nomic-embed, bge). When `TRUE`, `llama_embed_batch` uses efficient pooled batch decode.

Value

An external pointer (class externalptr) wrapping the inference context. This handle is required by generation, tokenization, and embedding functions. Freed automatically by the garbage collector or manually via llama_free_context.

Examples

## Not run: 
model <- llama_load_model("model.gguf")
ctx <- llama_new_context(model, n_ctx = 4096L, n_threads = 8L)
# ... use context for generation ...
llama_free_context(ctx)
llama_free_model(model)

# Tune for GPU prefill throughput
ctx <- llama_new_context(model, n_ctx = 4096L,
                         n_ubatch = 2048L, flash_attn = "on")

# Embedding mode
emb_ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE)
mat <- llama_embed_batch(emb_ctx, c("hello", "world"))

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.