| llama_new_context | R Documentation |
Create an inference context
llama_new_context(
model,
n_ctx = 2048L,
n_threads = NULL,
n_threads_batch = NULL,
n_batch = 2048L,
n_ubatch = 512L,
n_seq_max = 1L,
flash_attn = "auto",
embedding = FALSE
)
model |
Model handle returned by [llama_load_model] |
n_ctx |
Context window size (number of tokens). 0 means use the model's trained value. |
n_threads |
Number of CPU threads for single-token decode. |
n_threads_batch |
Number of CPU threads for batch (prompt) processing.
|
n_batch |
Logical maximum batch size submitted to a single decode call
(tokens). Default |
n_ubatch |
Physical micro-batch size used inside decode. Larger values
improve prefill throughput on GPU at the cost of memory. Default |
n_seq_max |
Maximum number of parallel sequences the context can hold
simultaneously (KV cache is partitioned across them). Default |
flash_attn |
One of |
embedding |
Logical; if |
An external pointer (class externalptr) wrapping the inference
context. This handle is required by generation, tokenization, and embedding
functions. Freed automatically by the garbage collector or manually via
llama_free_context.
## Not run:
model <- llama_load_model("model.gguf")
ctx <- llama_new_context(model, n_ctx = 4096L, n_threads = 8L)
# ... use context for generation ...
llama_free_context(ctx)
llama_free_model(model)
# Tune for GPU prefill throughput
ctx <- llama_new_context(model, n_ctx = 4096L,
n_ubatch = 2048L, flash_attn = "on")
# Embedding mode
emb_ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE)
mat <- llama_embed_batch(emb_ctx, c("hello", "world"))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.