llm-join is an R package designed to use Large Language Models (LLMs, such as GPT-5, Claude, DeepSeek, etc.) for fuzzy joining of dataframes. When the key columns of two dataframes have spelling differences, are in different languages, or cannot be matched exactly, llm-join can automatically generate prompts and utilize LLMs to assist in high-quality joining.
You can install the development version of llmjoin from GitHub with:
devtools::install_github("evanliu3594/llmjoin")
Please note that all information is stored strictly locally in your system configuration (run
tools::R_user_dir("llmjoin", "config")to see the full path), and is never uploaded or shared.
library(llmjoin)
# OpenAI
set_llm(provider = "openai", key = "your-api-key")
# Claude (Anthropic)
set_llm(provider = "claude", key = "your-api-key")
# Gemini (via OpenAI-compatible endpoint)
set_llm(provider = "gemini", key = "your-api-key")
# Custom endpoint (Ollama, DeepSeek, Kimi, Grok, Qwen, etc.)
set_llm(provider = "openai",
url = "https://your-custom-endpoint/v1/chat/completions",
key = "your-api-key",
model = "your-model-name")
Below examples used Deepseek-V4-Flash.
Match numeric month codes to month names — the LLM understands that "01" means January.
x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))
llm_join(x, y, key1 = "id", key2 = "month")
# month id value amount
# 1 Feb 02 20 200
# 2 January 01 10 100
# 3 <NA> 04 40 NA
Match approximate or differently-formatted numeric identifiers — the LLM handles rounding, unit conversion, and format differences.
left <- data.frame(
weight_kg = c(1.0, 2.5, 5.0),
product = c("Widget", "Gadget", "Thing")
)
right <- data.frame(
weight_lb = c("2.2 lb", "5.5 lb", "11 lb"),
price = c(4.99, 9.99, 19.99)
)
llm_join(left, right, key1 = "weight_kg", key2 = "weight_lb")
# weight_lb weight_kg product price
# 1 11 lb 5.0 Thing 19.99
# 2 2.2 lb 1.0 Widget 4.99
# 3 5.5 lb 2.5 Gadget 9.99
Match country names to ISO codes — the LLM bridges different naming conventions, languages, and abbreviations.
left <- data.frame(
country = c("China", "United States", "Germany", "日本"),
sales = c(1500, 3200, 2100, 800)
)
right <- data.frame(
code = c("CN", "US", "DE", "JP"),
region = c("Asia", "Americas", "Europe", "Asia")
)
llm_join(left, right, key1 = "country", key2 = "code")
# code country sales region
# 1 CN China 1500 Asia
# 2 DE Germany 2100 Europe
# 3 JP 日本 800 Asia
# 4 US United States 3200 Americas
x <- data.frame(id = c("01", "02", "04"), value = c(10, 20, 40))
y <- data.frame(month = c("January", "Feb", "May"), amount = c(100, 200, 400))
joint_prompt(unique(x["id"]), unique(y["month"])) |> writeClipboard()
Paste the prompts to ask your LLM model, and copy the answer, going back to R and continue run:
joint <- parse_joint(readr::clipboard(), key1 = "id", key2 = "month")
Reduce(\(x, y) merge(x, y, all.x = TRUE), list(x, joint, y))
# month id value amount
# 1 Feb 02 20 200
# 2 January 01 10 100
# 3 <NA> 04 40 NA
MIT License
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.