Nothing
llama_gen_begin() / llama_gen_next() / llama_gen_end() — token-by-token generation matching llama_generate() output, with valid-UTF-8 chunks.llama_serve_openai() — serve a local GGUF model over an OpenAI-compatible HTTP API (/v1/models, /v1/chat/completions, streaming and blocking) via the optional drogonR package.chat_llamar() — returns an ellmer::Chat backed by a local model, connecting to a running server (base_url=) or spawning one (model_path=); chat_llamar_stop() stops a spawned server.llama_n_batch()-sized chunks (was GGML_ASSERT(n_tokens_all <= cparams.n_batch)).llama_n_ctx_seq() — per-sequence context window size.llama_n_batch() — logical batch size (max tokens per llama_decode call).llama_n_ubatch() — physical micro-batch size.llama_n_seq_max() — maximum number of concurrent sequences.llama_n_threads() / llama_n_threads_batch() — read back thread counts set via llama_set_threads().llama_pooling_type() — pooling type of the context as a string ("none", "mean", "cls", "last", "rank").fflush macro from r_llama_compat.h
that broke std::fflush in <fstream> (Apple clang / libc++).llama_get_logits_ith() — logit vector for a specific token position in the last decoded batch. Supports negative indexing (-1 = last token).embed_llamar() — high-level embedding provider compatible with
ragnar_store_create(embed = ...). Supports partial application (lazy model
loading), direct call returning a matrix, and data.frame input. L2
normalization on by default.llama_embed_batch() — embed multiple texts in one call. Uses true pooled
batch decode (llama_get_embeddings_seq) for embedding models, with automatic
fallback to sequential last-token decode for generative models.llama_get_embeddings_ith() — get embedding vector for the i-th token
(supports negative indexing).llama_get_embeddings_seq() — get pooled embedding for a sequence ID.llama_new_context() gains embedding parameter. When TRUE, sets
cparams.embeddings = true and disables causal attention at creation time.
llama_embed_batch() uses this flag to choose the optimal code path.llama_load_model() gains devices parameter for explicit backend selection.
Accepts device names from llama_backend_devices(), type keywords ("cpu",
"gpu"), or numeric indices. Multiple devices enable multi-GPU split.llama_backend_devices() — list all available compute devices (CPU, GPU,
iGPU, accelerator) as a data.frame.llama_numa_init() — NUMA optimization with strategies: disabled, distribute,
isolate, numactl, mirror.llama_time_us() — current time in microseconds.llama_token_to_piece() — convert a single token ID to its text piece.llama_encode() — run the encoder pass for encoder-decoder models (e.g. T5, BART).llama_batch_init() / llama_batch_free() — low-level batch allocation and release
with automatic GC finalizer.extern "C" block wrapping #include <R.h> in r_llama_compat.h
(C++ templates cannot appear inside extern "C" linkage).Rinternals.h #define length(x) and
std::codecvt::length() in r_llama_interface.cpp:
C++ standard headers are now included before R headers, followed by
#undef length.llama_token_to_piece, llama_batch_init,
llama_batch_free, and llama_encode, including GPU context variants.llama_hf_list() — list GGUF files in a Hugging Face repository.llama_hf_download() — download a GGUF model with local caching.
Supports exact filename, glob pattern, or Ollama-style tag selection.llama_load_model_hf() — download and load a model in one step.llama_hf_cache_dir() — get the cache directory path.llama_hf_cache_info() — inspect cached models.llama_hf_cache_clear() — clear the model cache.jsonlite and utils to Imports.configure.win and Makevars.win.in.ggmlR is built with GPU support.exit() / _Exit() overrides to r_llama_compat.h to prevent
process termination (redirects to Rf_error()).ggmlR >= 0.5.4.ggmlR).ggmlR.\value tags to all exported functions describing
return class, structure, and meaning.\dontrun{} with \donttest{} in all examples.cph) for bundled
'llama.cpp' code.NEWS.md in the package tarball (removed from .Rbuildignore).cran-comments.md..Rbuildignore.Full LLM inference cycle is now available from R:
llama_load_model() / llama_free_model() — load and free GGUF modelsllama_new_context() / llama_free_context() — context managementllama_tokenize() / llama_detokenize() — tokenization and detokenizationllama_generate() — text generation with temperature, top_k, top_p, greedy supportllama_embeddings() — embedding extractionllama_model_info() — model metadataModel and context are wrapped as ExternalPtr with automatic GC finalizers. The context holds a reference to the model ExternalPtr, preventing premature collection.
llama_generate() runs the full pipeline in a single C++ call: prompt
tokenization → encode → autoregressive decode loop with a sampler chain →
detokenization of generated tokens.
19 assertions across 7 test blocks, all passing.
libggml.a from ggmlR packageggml_build_forward_select replaced with simplified branch selectionAny scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.