NEWS.md
In llamaR: Interface for Large Language Models via 'llama.cpp'

llamaR 0.2.4

llama_gen_begin() / llama_gen_next() / llama_gen_end() — token-by-token generation matching llama_generate() output, with valid-UTF-8 chunks.

llama_serve_openai() — serve a local GGUF model over an OpenAI-compatible HTTP API (/v1/models, /v1/chat/completions, streaming and blocking) via the optional drogonR package.

chat_llamar() — returns an ellmer::Chat backed by a local model, connecting to a running server (base_url=) or spawning one (model_path=); chat_llamar_stop() stops a spawned server.

Long prompts no longer abort: prefill is now split into llama_n_batch()-sized chunks (was GGML_ASSERT(n_tokens_all <= cparams.n_batch)).

llamaR 0.2.3

llama_n_ctx_seq() — per-sequence context window size.
llama_n_batch() — logical batch size (max tokens per llama_decode call).
llama_n_ubatch() — physical micro-batch size.
llama_n_seq_max() — maximum number of concurrent sequences.
llama_n_threads() / llama_n_threads_batch() — read back thread counts set via llama_set_threads().
llama_pooling_type() — pooling type of the context as a string ("none", "mean", "cls", "last", "rank").

Fixed macOS compilation error: removed fflush macro from r_llama_compat.h that broke std::fflush in <fstream> (Apple clang / libc++).

llama_get_logits_ith() — logit vector for a specific token position in the last decoded batch. Supports negative indexing (-1 = last token).

llamaR 0.2.2

embed_llamar() — high-level embedding provider compatible with ragnar_store_create(embed = ...). Supports partial application (lazy model loading), direct call returning a matrix, and data.frame input. L2 normalization on by default.

llama_embed_batch() — embed multiple texts in one call. Uses true pooled batch decode (llama_get_embeddings_seq) for embedding models, with automatic fallback to sequential last-token decode for generative models.
llama_get_embeddings_ith() — get embedding vector for the i-th token (supports negative indexing).
llama_get_embeddings_seq() — get pooled embedding for a sequence ID.

llama_new_context() gains embedding parameter. When TRUE, sets cparams.embeddings = true and disables causal attention at creation time. llama_embed_batch() uses this flag to choose the optimal code path.

llama_load_model() gains devices parameter for explicit backend selection. Accepts device names from llama_backend_devices(), type keywords ("cpu", "gpu"), or numeric indices. Multiple devices enable multi-GPU split.
llama_backend_devices() — list all available compute devices (CPU, GPU, iGPU, accelerator) as a data.frame.

llama_numa_init() — NUMA optimization with strategies: disabled, distribute, isolate, numactl, mirror.
llama_time_us() — current time in microseconds.

40+ new test blocks covering all new functions.
Total: 143 passing, 4 expected skips.

llamaR 0.2.1

llama_token_to_piece() — convert a single token ID to its text piece.
llama_encode() — run the encoder pass for encoder-decoder models (e.g. T5, BART).
llama_batch_init() / llama_batch_free() — low-level batch allocation and release with automatic GC finalizer.

Fixed compilation failure on macOS with Apple clang 17 / Xcode 16.4: removed extern "C" block wrapping #include <R.h> in r_llama_compat.h (C++ templates cannot appear inside extern "C" linkage).
Fixed macro conflict between Rinternals.h #define length(x) and std::codecvt::length() in r_llama_interface.cpp: C++ standard headers are now included before R headers, followed by #undef length.

Added 9 new test blocks covering llama_token_to_piece, llama_batch_init, llama_batch_free, and llama_encode, including GPU context variants.
Total: 103 passing, 4 expected skips.

llamaR 0.2.0

llama_hf_list() — list GGUF files in a Hugging Face repository.
llama_hf_download() — download a GGUF model with local caching. Supports exact filename, glob pattern, or Ollama-style tag selection.
llama_load_model_hf() — download and load a model in one step.
llama_hf_cache_dir() — get the cache directory path.
llama_hf_cache_info() — inspect cached models.
llama_hf_cache_clear() — clear the model cache.

Added jsonlite and utils to Imports.

llamaR 0.1.3

Added Vulkan linking support to configure.win and Makevars.win.in.
Windows builds now link with Vulkan when ggmlR is built with GPU support.

Added exit() / _Exit() overrides to r_llama_compat.h to prevent process termination (redirects to Rf_error()).

Requires ggmlR >= 0.5.4.
Bumped minimum R version to 4.1.0 (matches ggmlR).

Updated description to mention Vulkan GPU support via ggmlR.

llamaR 0.1.2

Expanded all acronyms in DESCRIPTION (LLMs, GPU).
Added detailed \value tags to all exported functions describing return class, structure, and meaning.
Replaced \dontrun{} with \donttest{} in all examples.

Added Georgi Gerganov as copyright holder (cph) for bundled 'llama.cpp' code.

Included NEWS.md in the package tarball (removed from .Rbuildignore).
Created cran-comments.md.
Cleaned up duplicate entries in .Rbuildignore.

llamaR 0.1.1

Full LLM inference cycle is now available from R:

llama_load_model() / llama_free_model() — load and free GGUF models
llama_new_context() / llama_free_context() — context management
llama_tokenize() / llama_detokenize() — tokenization and detokenization
llama_generate() — text generation with temperature, top_k, top_p, greedy support
llama_embeddings() — embedding extraction
llama_model_info() — model metadata

Model and context are wrapped as ExternalPtr with automatic GC finalizers. The context holds a reference to the model ExternalPtr, preventing premature collection.

llama_generate() runs the full pipeline in a single C++ call: prompt tokenization → encode → autoregressive decode loop with a sampler chain → detokenization of generated tokens.

19 assertions across 7 test blocks, all passing.

llamaR 0.1.0

Basic package structure with llama.cpp integration
Links against libggml.a from ggmlR package
Includes all llama.cpp model implementations (~100 architectures)
Vulkan GPU support (optional)

Requires ggmlR >= 0.5.1 for static library export

ggml_build_forward_select replaced with simplified branch selection

Any scripts or data that you put into this service are public.

llamaR documentation built on May 28, 2026, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

llamaR Interface for Large Language Models via 'llama.cpp'

NEWS.md In llamaR: Interface for Large Language Models via 'llama.cpp'

llamaR 0.2.4

Streaming generation

OpenAI-compatible server

ellmer integration

Bug fixes

llamaR 0.2.3

Context getters

Bug fixes

Logits

llamaR 0.2.2

ragnar integration

Batch embeddings

Context embedding mode

Backend & device selection

Hardware & system

Tests

llamaR 0.2.1

New functions

Bug fixes

Tests

llamaR 0.2.0

Hugging Face integration

New functions

Dependencies

llamaR 0.1.3

GPU and build system improvements

Vulkan GPU support on Windows

CRAN compliance

Dependencies

DESCRIPTION

llamaR 0.1.2

CRAN compliance fixes

Documentation

DESCRIPTION

Packaging

llamaR 0.1.1

R interface — first working release

Memory management

Generation internals

Tests

llamaR 0.1.0

Initial Release

Dependencies

Known Limitations

Try the llamaR package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

llamaR
Interface for Large Language Models via 'llama.cpp'

NEWS.md
In llamaR: Interface for Large Language Models via 'llama.cpp'