JSON output vs. schema-validated output in LLMR

knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") )

TL;DR


What the major providers actually support


Why prefer schema output?

Why JSON-only still matters


Quirks you will hit in practice

LLMR helpers to blunt those edges


Minimal patterns (guarded code)

All chunks use a tiny helper so your document knits even without API keys.

safe <- function(expr) tryCatch(expr, error = function(e) {message("ERROR: ", e$message); NULL})

1) JSON mode, no schema (works across OpenAI-compatible providers)

safe({
  library(LLMR)
  cfg <- llm_config(
    provider = "openai",                # try "groq" or "together" too
    model    = "gpt-4o-mini",
    temperature = 0
  )

  # Flip JSON mode on (OpenAI-compat shape)
  cfg_json <- enable_structured_output(cfg, schema = NULL)

  res    <- call_llm(cfg_json, 'Give me a JSON object {"ok": true, "n": 3}.')
  parsed <- llm_parse_structured(res)

  cat("Raw text:\n", as.character(res), "\n\n")
  str(parsed)
})

What could still fail? Proxies labeled “OpenAI-compatible” sometimes accept response_format but don’t strictly enforce it; LLMR’s parser recovers from fences or pre/post text.


2) Schema mode that actually works (Groq + Qwen, open-weights / non-commercial friendly)

Groq serves Qwen 2.5 Instruct models with OpenAI-compatible APIs. Their Structured Outputs feature enforces JSON Schema and (notably) expects all properties to be listed under required.

safe({
  library(LLMR); library(dplyr)

  # Schema: make every property required to satisfy Groq's stricter check
  schema <- list(
    type = "object",
    additionalProperties = FALSE,
    properties = list(
      title = list(type = "string"),
      year  = list(type = "integer"),
      tags  = list(type = "array", items = list(type = "string"))
    ),
    required = list("title","year","tags")
  )

  cfg <- llm_config(
    provider = "groq",
    model    = "qwen-2.5-72b-instruct",   # a Qwen Instruct model on Groq
    temperature = 0
  )
  cfg_strict <- enable_structured_output(cfg, schema = schema, strict = TRUE)

  df  <- tibble(x = c("BERT paper", "Vision Transformers"))
  out <- llm_fn_structured(
    df,
    prompt   = "Return JSON about '{x}' with fields title, year, tags.",
    .config  = cfg_strict,
    .schema  = schema,          # send schema to provider
    .fields  = c("title","year","tags"),
    .validate_local = TRUE
  )

  out %>% select(structured_ok, structured_valid, title, year, tags) %>% print(n = Inf)
})

If your key is set, you should see structured_ok = TRUE, structured_valid = TRUE, plus parsed columns. (Tip: if you see a 400 complaining about required, add all properties to required, as above.)


3) Anthropic: force a schema via a tool (may require max_tokens)

safe({
  library(LLMR)
  schema <- list(
    type="object",
    properties=list(answer=list(type="string"), confidence=list(type="number")),
    required=list("answer","confidence"),
    additionalProperties=FALSE
  )

  cfg <- llm_config("anthropic","claude-3-7", temperature = 0)
  cfg <- enable_structured_output(cfg, schema = schema, name = "llmr_schema")

  res <- call_llm(cfg, c(
    system = "Return only the tool result that matches the schema.",
    user   = "Answer: capital of Japan; include confidence in [0,1]."
  ))

  parsed <- llm_parse_structured(res)
  str(parsed)
})

Anthropic requires max_tokens; LLMR warns and defaults if you omit it.


4) Gemini: JSON response (plus optional response schema on supported models)

safe({
  library(LLMR)

  cfg <- llm_config(
    "gemini", "gemini-2.0-flash",
    response_mime_type = "application/json"  # ask for JSON back
    # Optionally: gemini_enable_response_schema = TRUE, response_schema = <your JSON Schema>
  )

  res <- call_llm(cfg, c(
    system = "Reply as JSON only.",
    user   = "Produce fields name and score about 'MNIST'."
  ))
  str(llm_parse_structured(res))
})

Defensive patterns (no API calls)

safe({
  library(LLMR); library(tibble)

  messy <- c(
    '```json\n{"x": 1, "y": [1,2,3]}\n```',
    'Sure! Here is JSON: {"x":"1","y":"oops"} trailing words',
    '{"x":1, "y":[2,3,4]}'
  )

  tibble(response_text = messy) |>
    llm_parse_structured_col(
      fields = c(x = "x", y = "/y/0")   # dot/bracket or JSON Pointer
    ) |>
    print(n = Inf)
})

Why this helps Works when outputs arrive fenced, with pre/post text, or when arrays sneak in. Non-scalars become list-columns (set allow_list = FALSE to force scalars only).


Choosing the mode


References



Try the LLMR package in your browser

Any scripts or data that you put into this service are public.

LLMR documentation built on Aug. 26, 2025, 9:08 a.m.