Writing evals for your LLM product

#| label: setup
#| eval: false
library(vitals)

Writing evaluations for LLM products is hard. It takes a lot of effort to design a system that effectively identifies value and surfaces issues. So, why do so?

First, public evals will not cut it. The release posts for new language models often include tables displaying results from several of the most popular public evaluations. For example, this table from the Claude 4 release post:

{style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.3);"}

These widely used, off-the-shelf evals measure various capabilities of models like real-world GitHub issue resolution, tool use, etc. While these public evaluations serve as a good reference point for choosing which models to experiment with initially, they're not a good measure of how well your product specifically performs. As @yan2024evals writes, "If you've ran off-the-shelf evals for your tasks, you may have found that most don't work. They barely correlate with application-specific performance and aren't discriminative enough to use in production." When designing some LLM product, your task is to evaluate that product rather than the model underlying it.

Does an LLM product really need its own evaluation, though? Especially in early development stages, couldn't the product builder and/or domain expert just test it out themselves, interactively? While this works in the earliest stages of experimentation, you can't get by with not running evals if you want your product to go to production. As applications develop from prototypes into production systems, keeping up with all of the potential points of failure while trying to iterate can become unsustainable without sufficient automation: "Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole" [@husain2024evals]. Thus, without automated evals, as you otherwise get closer and closer to production-readiness, the pace at which you can iterate on your product slows to a halt. In contrast, "If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment" [@husain2024evals].

In short, custom evals are necessary to bring your LLM product from demo to production.

How to write good evals

Writing an eval with vitals requires defining a dataset, a solver, and scorer. I'll speak to each of these elements individually.

Datasets

In vitals, datasets are a data frame with columns, minimally, input and target. A "sample" is a row of a dataset. input defines questions inputted by end users and target defines the target answer and/or grading guidance for it. What sorts of input prompts should you include, though?

In short, inputs should be natural. Rather than "setting up" the model with exactly the right context and phrasing, "[i]t's important that the dataset... represents the types of interactions that your AI will have in production" [@husain2024judge].

If your system is going to answer a set of questions similar to some set that already exists—support tickets, for example—use the actual tickets themselves rather than writing your own from scratch. In this case, refrain from correcting spelling errors, removing unneeded context, or doing any "sanitizing" before providing the system with the input; you want the distribution of inputs to resemble what the system will encounter in the wild as closely as possible.

If there is no existing resource of input prompts to pull from, still try to avoid this sort of unrealistic set-up. I'll specifically call out multiple choice questions here—while multiple choice responses are easy to grade automatically, your inputs should only provide a system with multiple choices to select from if the production system will also have access to multiple choices [@press2024benchmarks]. If you're writing your own questions, I encourage you to read the "Dimensions for Structuring Your Dataset" section from @husain2024judge, which provides a few axes to keep in mind when thinking about how to generate data that resembles what your system will ultimately see:

You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:

Features: Specific functionalities of your AI product.

Scenarios: Situations or problems the AI may encounter and needs to handle.

Personas: Representative user profiles with distinct characteristics and needs.

The other part of the "how" is the mechanics. You probably don't want to paste in a bunch of questions into a call to tibble(), escape a bunch of quotes, etc. Instead, here's how I've written evals:

Solvers

The solver is your AI product itself. If the product is a chat app or an app that integrates with a chat feature, you can supply the ellmer Chat object powering the chat directly as the solver_chat argument to generate() and you're ready to go. In that case, ensure that the same prompt and tools are all attached to the Chat object as they would be in your product.

Scorers

Scorers take the input, target, and the solver's output, and deliver some judgment about whether the solver's output satisfies the grading guidance from target closely enough. At the most fundamental level, LLMs are useful in situations where inputs and outputs are diverse and considerably variable in structure; as a result, determining correctness in evals is a hard problem.

When implementing a scorer, you have a few options

| Technique | Deterministic Scoring | LLM-as-a-judge / model grading | Human grading | |------------------|------------------|------------------|------------------| | Speed | Very fast | Pretty fast | Very slow | | Cost | Very cheap / "free" | Pretty cheap | Very expensive | | Applicability | Narrow | Broad | Broad |

Understandably, many people bristle at the thought of LLMs evaluating their own output; while these systems do take some careful refinement, "when implemented well, LLM-as-Judge achieves decent correlation with human judgments" [@yan2024building]. While vitals will safeguard you from many common LLM-as-a-judge missteps with model_graded_qa(), some recommendations on implementing these systems for your own use cases:



Try the vitals package in your browser

Any scripts or data that you put into this service are public.

vitals documentation built on June 24, 2025, 9:08 a.m.