chat_llamar: Chat with a local model through an ellmer::Chat object
In llamaR: Interface for Large Language Models via 'llama.cpp'

chat_llamar

R Documentation

Chat with a local model through an ellmer::Chat object

Description

Returns an ellmer Chat object backed by a local GGUF model, so the whole ellmer / ragnar toolchain (turns, tools, streaming, structured output, ragnar_register_tool_retrieve(), …) works against local inference. Transport is the OpenAI-compatible HTTP API from llama_serve_openai; this function is a thin chat_vllm wrapper over it. (We use the vLLM provider because it speaks /v1/chat/completions — the de-facto standard our server implements — whereas ellmer's chat_openai/ chat_openai_compatible target OpenAI's newer /v1/responses.)

Usage

chat_llamar(
  model_path = NULL,
  base_url = NULL,
  port = 11434L,
  n_ctx = 4096L,
  n_gpu_layers = -1L,
  model_id = NULL,
  system_prompt = NULL,
  timeout = 180,
  ...
)

Arguments

`model_path`	Path to a GGUF model file. Spawns a server (mode A). Mutually exclusive with `base_url`.
`base_url`	Base URL of a running OpenAI-compatible server, e.g. `"http://127.0.0.1:11434/v1"`. Connects to it (mode B). Mutually exclusive with `model_path`.
`port`	Port for the spawned server (mode A only). Default `11434`.
`n_ctx`, `n_gpu_layers`	Passed to `llama_serve_openai` when spawning (mode A only).
`model_id`	Model identifier reported to ellmer. Defaults to the model file's base name in mode A; `"llamar"` in mode B.
`system_prompt`	Optional system prompt for the chat.
`timeout`	Seconds to wait for a spawned server to accept connections before erroring (mode A only). Default `180` — large models (e.g. a 14B at Q8) can take a couple of minutes to load from disk.
`...`	Passed on to `chat_vllm`.

Details

Two modes, picked by which argument you pass (DBI-style — like DBI::dbConnect() accepting either connection parameters or a ready connection):

base_url: Connect to a server you already started (e.g. llama_serve_openai() in another process, or a worker pool). No process is spawned.
model_path: Spin up llama_serve_openai() in a background R process (via callr), wait for it to come up, and return a Chat pointed at it. The server process's lifetime is tied to the returned object: when it is garbage-collected (or R exits), the process is killed. Stop it eagerly with chat_llamar_stop.

Exactly one of base_url or model_path must be supplied.

Value

An ellmer Chat object. In mode A it additionally carries the background process handle (see chat_llamar_stop).

Concurrency

The server is single-sequence (one request at a time); see llama_serve_openai. For parallel sessions, run a pool of servers on different ports and create one chat_llamar(base_url=) per worker.

Tool calls

Tool calling and structured output are mediated by the OpenAI protocol, so they work only as far as the server implements them. The current server does not emit tool_calls yet (see TODO), so ellmer tools registered on the returned chat will not be invoked by the model.

Examples

## Not run: 
# Mode A: spawn a server for this model and chat with it.
chat <- chat_llamar(model_path = "model.gguf")
chat$chat("Why is the sky blue?")
chat_llamar_stop(chat)            # or let GC do it

# Mode B: connect to a server you already run.
llama_serve_openai("model.gguf", port = 11434L)   # in another process
chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1")
chat$chat("Hello!")

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.