# Every chunk needs a GGUF model (and usually a GPU), so this vignette is # static: the code is shown but not run at build time. knitr::opts_chunk$set(eval = FALSE, purl = FALSE)
llamaR provides R bindings to llama.cpp
for running Large Language Models locally, with optional Vulkan GPU acceleration
via ggmlR. This vignette walks through the
core workflow: get a model, load it, generate text, tokenize, and extract
embeddings. For the chat/server side see vignette("chat-and-agents").
library(llamaR)
llamaR works with GGUF files. Download one from the Hugging Face Hub (cached
under ~/.cache/llamaR/ by default):
# List the GGUF files in a repo llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF") # Download one (by filename or by quantization pattern) path <- llama_hf_download( "TheBloke/Mistral-7B-Instruct-v0.2-GGUF", pattern = "Q4_K_M" )
Or point at any GGUF file you already have on disk.
A model holds the weights; a context holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional.
model <- llama_load_model(path, n_gpu_layers = -1L) # -1 = offload all layers ctx <- llama_new_context(model, n_ctx = 4096L) llama_model_info(model) # size, n_params, context length, heads, ...
n_gpu_layers = -1L offloads every layer to the GPU when Vulkan is available,
and falls back to CPU otherwise.
llama_generate(ctx, "The capital of France is", max_new_tokens = 32L)
Sampling is controlled by arguments (set temp = 0 for greedy decoding):
llama_generate( ctx, "Write a haiku about autumn.", max_new_tokens = 64L, temp = 0.7, top_p = 0.9, top_k = 40L, repeat_penalty = 1.1 )
Pass with_timings = TRUE to get token throughput alongside the text.
Instruction-tuned models expect their prompt wrapped in a chat template
([INST]…[/INST], <|im_start|>…, etc.). llama_chat_apply_template() builds
that prompt from a list of role/content messages:
messages <- list( list(role = "system", content = "You are a helpful assistant."), list(role = "user", content = "Name three primary colors.") ) prompt <- llama_chat_apply_template(messages) # uses the model's built-in template llama_generate(ctx, prompt, max_new_tokens = 64L)
For multi-turn chat with history management, use chat_llamar() instead — see
vignette("chat-and-agents").
tokens <- llama_tokenize(ctx, "Hello, world!") tokens llama_detokenize(ctx, tokens)
When tokenizing a prompt that already contains role markers from a chat
template, set parse_special = TRUE so markers like [INST] become single
control tokens rather than literal characters:
prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi"))) llama_tokenize(ctx, prompt, parse_special = TRUE)
Create the context in embedding mode, then extract vectors. Single text:
emb_model <- llama_load_model("embedding-model.gguf") emb_ctx <- llama_new_context(emb_model, embedding = TRUE) v <- llama_embeddings(emb_ctx, "The quick brown fox") length(v)
A batch of texts in one call:
m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text")) dim(m) # one row per input
embed_llamar() is a higher-level helper that loads the model for you and
returns a provider suitable for ragnar_store_create(embed = ...). Called with
a model only, it returns a closure (partial application); called with text, it
returns a matrix.
library(ragnar) store <- ragnar_store_create( location = "store.duckdb", embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L) ) ragnar_store_insert(store, documents) ragnar_store_build_index(store) ragnar_retrieve(store, "search query")
Combine this with a local chat_llamar() for a fully local RAG stack — see
vignette("chat-and-agents").
To talk to a model over HTTP, or to use it through the ellmer/ragnar toolchain,
see vignette("chat-and-agents"):
llama_serve_openai() — OpenAI-compatible HTTP server.chat_llamar() — an ellmer::Chat backed by a local model.vignette("chat-and-agents") — server, ellmer, ragnar, OpenCode.?llama_generate, ?llama_chat_apply_template, ?embed_llamarAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.