README.md
In llamaR: Interface for Large Language Models via 'llama.cpp'

llamaR

R interface to llama.cpp for running local inference of large language models (LLMs) directly from R.

The package supports GPU acceleration via Vulkan, and automatically falls back to CPU when no GPU is available.

Load and unload models in GGUF format (llama_load_model, llama_free_model)
Create and free contexts (llama_new_context, llama_free_context)
Tokenization, detokenization and text generation (llama_tokenize, llama_detokenize, llama_generate)
Streaming (token-by-token) generation (llama_gen_begin, llama_gen_next, llama_gen_end)
OpenAI-compatible HTTP server for local models (llama_serve_openai) — connect OpenCode, ellmer, the openai SDK, etc.
ellmer Chat objects backed by local models (chat_llamar) — use the ellmer/ragnar toolchain against local inference
Embedding extraction: single (llama_embeddings), batch (llama_embed_batch), ragnar-compatible (embed_llamar)
Hugging Face integration: download and cache models (llama_hf_download, llama_load_model_hf, etc.)
Encoder-decoder model support (T5, BART) via llama_encode
Explicit backend/device selection and multi-GPU split (llama_load_model(devices = ...))
NUMA optimization (llama_numa_init)

The package uses ggmlR as the low-level backend. If ggmlR was built with Vulkan support enabled, llamaR automatically uses the GPU for computation. On systems without a GPU, all code runs on CPU with no additional configuration required.

Vulkan support is compiled entirely within ggmlR — llamaR does not compile any Vulkan code itself. However, since llamaR links against libggml.a (from ggmlR) using --whole-archive, the Vulkan symbols (e.g. vkCmdCopyBuffer, vkGetInstanceProcAddr) need to be resolved at link time.

The llamaR configure script handles this automatically: - Linux: checks pkg-config --exists vulkan and adds -lvulkan to the linker flags - Windows: checks for the VULKAN_SDK environment variable and adds -lvulkan-1

If Vulkan is not found on the system, the build proceeds without it — the Vulkan backend in libggml.a will simply remain unused, and inference runs on CPU only.

Measured on AMD Ryzen 5 5600 + AMD RX 9070, model Ministral-3-3B-Instruct-2512-Q8_0, 50 tokens, avg of 3 runs:

| Backend | Speed (tokens/sec) | Speedup | |---|---:|---:| | CPU (8 threads) | 8.5 | 1.0x | | GPU (Vulkan) | 108.0 | 12.7x |

Requires ggmlR >= 0.5.4:

# Install ggmlR first
remotes::install_github("Zabis13/ggmlR")

# Then llamaR
remotes::install_github("Zabis13/llamaR")

R >= 4.1.0
C++17 compiler
GNU make

library(llamaR)

# Load model
model <- llama_load_model("path/to/model.gguf")

# Create context
ctx <- llama_new_context(model, n_ctx = 2048L, n_threads = 8L)

# Generate text
result <- llama_generate(ctx, "Once upon a time", max_new_tokens = 100L)
cat(result)

# Free resources (optional, GC handles this automatically)
llama_free_context(ctx)
llama_free_model(model)

Two guides walk through the package in depth:

Getting Started — loading models, generation, chat templates, tokenization, and embeddings.
Chat and Agents — chat_llamar(), the OpenAI-compatible server, connecting OpenCode/ellmer, and retrieval-augmented chat with ragnar.

browseVignettes("llamaR")
vignette("getting-started", package = "llamaR")
vignette("chat-and-agents", package = "llamaR")

Download GGUF models directly from Hugging Face with automatic caching:

library(llamaR)

# List available GGUF files in a repository
files <- llama_hf_list("TheBloke/Llama-2-7B-GGUF")
print(files)

# Download a specific quantization
path <- llama_hf_download("TheBloke/Llama-2-7B-GGUF", pattern = "*q4_k_m*")

# Or download and load in one step
model <- llama_load_model_hf("TheBloke/Llama-2-7B-GGUF",
                              pattern = "*q4_k_m*",
                              n_gpu_layers = -1L)

# Manage cache
llama_hf_cache_info()
llama_hf_cache_clear()

For private repositories, set the HF_TOKEN environment variable or pass token directly.

# CPU only
model <- llama_load_model("model.gguf")

# With GPU acceleration (all layers)
model <- llama_load_model("model.gguf", n_gpu_layers = -1L)

# Partial GPU offload (first 20 layers)
model <- llama_load_model("model.gguf", n_gpu_layers = 20L)

# Explicit device selection (see llama_backend_devices())
model <- llama_load_model("model.gguf", n_gpu_layers = -1L, devices = "Vulkan0")

# Check GPU availability
if (llama_supports_gpu()) {
  message("GPU available")
}

info <- llama_model_info(model)
cat("Model:", info$desc, "\n")
cat("Layers:", info$n_layer, "\n")
cat("Context:", info$n_ctx_train, "\n")
cat("Embedding size:", info$n_embd, "\n")

ctx <- llama_new_context(model, n_ctx = 4096L)

# Basic generation
result <- llama_generate(ctx, "The meaning of life is")

# Greedy decoding (deterministic)
result <- llama_generate(ctx, "2 + 2 =", temp = 0)

# Creative output
result <- llama_generate(ctx,
  prompt = "Write a haiku about R:",
  max_new_tokens = 50L,
  temp = 1.0,
  top_p = 0.95,
  top_k = 40L
)

model <- llama_load_model("llama-3.2-instruct.gguf", n_gpu_layers = -1L)
ctx <- llama_new_context(model)

# Get template from model
tmpl <- llama_chat_template(model)

# Build conversation
messages <- list(
  list(role = "system", content = "You are a helpful assistant."),
  list(role = "user", content = "What is R?")
)

# Apply template
prompt <- llama_chat_apply_template(messages, template = tmpl)

# Generate response
response <- llama_generate(ctx, prompt, max_new_tokens = 200L)
cat(response)

Pull tokens one at a time instead of waiting for the full result — useful for live output or feeding a stream. Concatenating every chunk reproduces the llama_generate() result for the same seed.

st <- llama_gen_begin(ctx, "Once upon a time", max_new_tokens = 100L)
repeat {
  chunk <- llama_gen_next(st)   # next piece of text, or NULL when done
  if (is.null(chunk)) break
  cat(chunk)
}
cat(llama_gen_end(st))          # flush any held-back trailing bytes

Serve a local model over an OpenAI-compatible HTTP API so any OpenAI client can talk to it. Requires the optional drogonR package (install.packages("drogonR")).

# Blocks, serving GET /v1/models and POST /v1/chat/completions
# (both blocking and stream = true). Default port 11434.
llama_serve_openai("model.gguf", port = 11434L)

Point any OpenAI client at http://127.0.0.1:11434/v1, e.g.:

curl http://127.0.0.1:11434/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"model","messages":[{"role":"user","content":"Hello"}]}'

A runnable example lives at inst/examples/serve_openai.R:

# Just serve:  args are <model.gguf> [port] [n_ctx]
Rscript inst/examples/serve_openai.R model.gguf 11434 16384

# Or self-test both endpoints end-to-end (needs callr + curl):
Rscript inst/examples/serve_openai.R model.gguf --selftest

To connect OpenCode, add an OpenAI-compatible provider in opencode.json (see the one in this repo) pointing baseURL at http://127.0.0.1:11434/v1, with the model id matching what /v1/models reports.

chat_llamar() returns an ellmer Chat object backed by a local model, so the whole ellmer / ragnar toolchain works against local inference. Requires the optional ellmer package (and callr when spawning a server).

# Spawn a server for this model and chat with it; the background process is
# tied to the returned object (stop it with chat_llamar_stop(), or let GC).
chat <- chat_llamar(model_path = "model.gguf")
chat$chat("Why is the sky blue?")
chat_llamar_stop(chat)

# Or connect to a server you already started with llama_serve_openai():
chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1")
chat$chat("Hello!")

It wraps ellmer::chat_vllm(), talking to the server's /v1/chat/completions endpoint.

# Text -> tokens
tokens <- llama_tokenize(ctx, "Hello, world!")

# Tokens -> text
text <- llama_detokenize(ctx, tokens)

# Single text embedding
emb1 <- llama_embeddings(ctx, "machine learning")
emb2 <- llama_embeddings(ctx, "artificial intelligence")

# Cosine similarity
similarity <- sum(emb1 * emb2) / (sqrt(sum(emb1^2)) * sqrt(sum(emb2^2)))
cat("Similarity:", similarity, "\n")

# Batch embeddings (matrix output)
ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE)
mat <- llama_embed_batch(ctx, c("hello world", "foo bar", "test"))
# mat is a 3 x n_embd matrix

Use embed_llamar() as an embedding provider for ragnar:

library(ragnar)

# Create store with local embedding model
store <- ragnar_store_create(
  "my_store",
  embed = embed_llamar(
    model = "nomic-embed-text-v1.5.Q8_0.gguf",
    n_gpu_layers = -1,
    embedding = TRUE
  )
)

# Insert and retrieve documents as usual
ragnar_store_insert(store, documents)
ragnar_retrieve(store, "search query")

# List available devices
llama_backend_devices()
#>         name           description  type
#> 1 CPU        CPU (threads: 16)      cpu
#> 2 Vulkan0    NVIDIA GeForce RTX 4090 gpu

# Load model on specific device
model <- llama_load_model("model.gguf", n_gpu_layers = -1, devices = "Vulkan0")

# CPU-only (even if GPU is available)
model <- llama_load_model("model.gguf", devices = "cpu")

# Multi-GPU split
model <- llama_load_model("model.gguf", n_gpu_layers = -1,
                          devices = c("Vulkan0", "Vulkan1"))

model <- llama_load_model("base-model.gguf")
ctx <- llama_new_context(model)

# Load and apply adapter
lora <- llama_lora_load(model, "fine-tuned.gguf")
llama_lora_apply(ctx, lora, scale = 1.0)

# Generate with LoRA
result <- llama_generate(ctx, "prompt")

# Remove all LoRA adapters
llama_lora_clear(ctx)

# Levels: 0 = silent, 1 = errors only, 2 = normal, 3 = verbose
llama_set_verbosity(0)  # Suppress all output
llama_set_verbosity(3)  # Debug mode

Supports all llama.cpp compatible architectures (100+), including:

LLaMA 1/2/3
Mistral / Mixtral
Qwen / Qwen2
Gemma / Gemma 2
Phi-2 / Phi-3
DeepSeek
Command-R
and many more

Models must be in GGUF format. Convert models using llama.cpp tools.

MIT

Yuri Baramykov

GitHub
Bug Reports
ggmlR - tensor operations dependency
llama.cpp - inference backend

Any scripts or data that you put into this service are public.

llamaR documentation built on May 28, 2026, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

llamaR
Interface for Large Language Models via 'llama.cpp'

README.md
In llamaR: Interface for Large Language Models via 'llama.cpp'

llamaR

Key Features

GPU and CPU Support

How Vulkan linking works

Performance

Installation

Dependencies

System Requirements

Quick Start

Vignettes

Downloading Models from Hugging Face

Usage

Loading Models

Model Information

Text Generation

Chat Models

Streaming Generation

OpenAI-Compatible Server

Chatting via ellmer

Tokenization

Embeddings

ragnar Integration

Backend and Device Selection

LoRA Adapters

Verbosity Control

Supported Models

License

Author

Links

Try the llamaR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

llamaR Interface for Large Language Models via 'llama.cpp'

README.md In llamaR: Interface for Large Language Models via 'llama.cpp'

llamaR

Key Features

GPU and CPU Support

How Vulkan linking works

Performance

Installation

Dependencies

System Requirements

Quick Start

Vignettes

Downloading Models from Hugging Face

Usage

Loading Models

Model Information

Text Generation

Chat Models

Streaming Generation

OpenAI-Compatible Server

Chatting via ellmer

Tokenization

Embeddings

ragnar Integration

Backend and Device Selection

LoRA Adapters

Verbosity Control

Supported Models

License

Author

Links

Try the llamaR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

llamaR
Interface for Large Language Models via 'llama.cpp'

README.md
In llamaR: Interface for Large Language Models via 'llama.cpp'