chat_llamar: Chat with a local model through an ellmer::Chat object

View source: R/chat.R

chat_llamarR Documentation

Chat with a local model through an ellmer::Chat object

Description

Returns an ellmer Chat object backed by a local GGUF model, so the whole ellmer / ragnar toolchain (turns, tools, streaming, structured output, ragnar_register_tool_retrieve(), …) works against local inference. Transport is the OpenAI-compatible HTTP API from llama_serve_openai; this function is a thin chat_vllm wrapper over it. (We use the vLLM provider because it speaks /v1/chat/completions — the de-facto standard our server implements — whereas ellmer's chat_openai/ chat_openai_compatible target OpenAI's newer /v1/responses.)

Usage

chat_llamar(
  model_path = NULL,
  base_url = NULL,
  port = 11434L,
  n_ctx = 4096L,
  n_gpu_layers = -1L,
  model_id = NULL,
  system_prompt = NULL,
  timeout = 180,
  ...
)

Arguments

model_path

Path to a GGUF model file. Spawns a server (mode A). Mutually exclusive with base_url.

base_url

Base URL of a running OpenAI-compatible server, e.g. "http://127.0.0.1:11434/v1". Connects to it (mode B). Mutually exclusive with model_path.

port

Port for the spawned server (mode A only). Default 11434.

n_ctx, n_gpu_layers

Passed to llama_serve_openai when spawning (mode A only).

model_id

Model identifier reported to ellmer. Defaults to the model file's base name in mode A; "llamar" in mode B.

system_prompt

Optional system prompt for the chat.

timeout

Seconds to wait for a spawned server to accept connections before erroring (mode A only). Default 180 — large models (e.g. a 14B at Q8) can take a couple of minutes to load from disk.

...

Passed on to chat_vllm.

Details

Two modes, picked by which argument you pass (DBI-style — like DBI::dbConnect() accepting either connection parameters or a ready connection):

base_url

Connect to a server you already started (e.g. llama_serve_openai() in another process, or a worker pool). No process is spawned.

model_path

Spin up llama_serve_openai() in a background R process (via callr), wait for it to come up, and return a Chat pointed at it. The server process's lifetime is tied to the returned object: when it is garbage-collected (or R exits), the process is killed. Stop it eagerly with chat_llamar_stop.

Exactly one of base_url or model_path must be supplied.

Value

An ellmer Chat object. In mode A it additionally carries the background process handle (see chat_llamar_stop).

Concurrency

The server is single-sequence (one request at a time); see llama_serve_openai. For parallel sessions, run a pool of servers on different ports and create one chat_llamar(base_url=) per worker.

Tool calls

Tool calling and structured output are mediated by the OpenAI protocol, so they work only as far as the server implements them. The current server does not emit tool_calls yet (see TODO), so ellmer tools registered on the returned chat will not be invoked by the model.

See Also

[llama_serve_openai], [chat_llamar_stop]

Examples

## Not run: 
# Mode A: spawn a server for this model and chat with it.
chat <- chat_llamar(model_path = "model.gguf")
chat$chat("Why is the sky blue?")
chat_llamar_stop(chat)            # or let GC do it

# Mode B: connect to a server you already run.
llama_serve_openai("model.gguf", port = 11434L)   # in another process
chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1")
chat$chat("Hello!")

## End(Not run)

llamaR documentation built on May 28, 2026, 1:06 a.m.