#| label: setup #| eval: false library(vitals)
Writing evaluations for LLM products is hard. It takes a lot of effort to design a system that effectively identifies value and surfaces issues. So, why do so?
First, public evals will not cut it. The release posts for new language models often include tables displaying results from several of the most popular public evaluations. For example, this table from the Claude 4 release post:
{style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.3);"}
These widely used, off-the-shelf evals measure various capabilities of models like real-world GitHub issue resolution, tool use, etc. While these public evaluations serve as a good reference point for choosing which models to experiment with initially, they're not a good measure of how well your product specifically performs. As @yan2024evals writes, "If you've ran off-the-shelf evals for your tasks, you may have found that most don't work. They barely correlate with application-specific performance and aren't discriminative enough to use in production." When designing some LLM product, your task is to evaluate that product rather than the model underlying it.
Does an LLM product really need its own evaluation, though? Especially in early development stages, couldn't the product builder and/or domain expert just test it out themselves, interactively? While this works in the earliest stages of experimentation, you can't get by with not running evals if you want your product to go to production. As applications develop from prototypes into production systems, keeping up with all of the potential points of failure while trying to iterate can become unsustainable without sufficient automation: "Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole" [@husain2024evals]. Thus, without automated evals, as you otherwise get closer and closer to production-readiness, the pace at which you can iterate on your product slows to a halt. In contrast, "If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment" [@husain2024evals].
In short, custom evals are necessary to bring your LLM product from demo to production.
Writing an eval with vitals requires defining a dataset, a solver, and scorer. I'll speak to each of these elements individually.
In vitals, datasets are a data frame with columns, minimally, input
and target
. A "sample" is a row of a dataset. input
defines questions inputted by end users and target
defines the target answer and/or grading guidance for it. What sorts of input prompts should you include, though?
In short, inputs should be natural. Rather than "setting up" the model with exactly the right context and phrasing, "[i]t's important that the dataset... represents the types of interactions that your AI will have in production" [@husain2024judge].
If your system is going to answer a set of questions similar to some set that already exists—support tickets, for example—use the actual tickets themselves rather than writing your own from scratch. In this case, refrain from correcting spelling errors, removing unneeded context, or doing any "sanitizing" before providing the system with the input; you want the distribution of inputs to resemble what the system will encounter in the wild as closely as possible.
If there is no existing resource of input prompts to pull from, still try to avoid this sort of unrealistic set-up. I'll specifically call out multiple choice questions here—while multiple choice responses are easy to grade automatically, your inputs should only provide a system with multiple choices to select from if the production system will also have access to multiple choices [@press2024benchmarks]. If you're writing your own questions, I encourage you to read the "Dimensions for Structuring Your Dataset" section from @husain2024judge, which provides a few axes to keep in mind when thinking about how to generate data that resembles what your system will ultimately see:
You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:
Features: Specific functionalities of your AI product.
Scenarios: Situations or problems the AI may encounter and needs to handle.
Personas: Representative user profiles with distinct characteristics and needs.
The other part of the "how" is the mechanics. You probably don't want to paste in a bunch of questions into a call to tibble()
, escape a bunch of quotes, etc. Instead, here's how I've written evals:
input
and target
. Do you want to tag questions with categories? Include a source URL? This isn't stuff that will actually be integrated into the eval, but that will be carried along throughout the evaluation to aid you with analysis down the line.input
and another for the target
, fields for whatever metadata you'd like to carry along, and a "Submit" button. When you click "Submit", the app should write the inputted values into a .yaml
file in some directory and "clear" the app so that you can then add a new sample. Here's a sample prompt that I used to help me write an eval dataset recently.The solver is your AI product itself. If the product is a chat app or an app that integrates with a chat feature, you can supply the ellmer Chat object powering the chat directly as the solver_chat
argument to generate()
and you're ready to go. In that case, ensure that the same prompt and tools are all attached to the Chat object as they would be in your product.
Scorers take the input
, target
, and the solver's output, and deliver some judgment about whether the solver's output satisfies the grading guidance from target
closely enough. At the most fundamental level, LLMs are useful in situations where inputs and outputs are diverse and considerably variable in structure; as a result, determining correctness in evals is a hard problem.
When implementing a scorer, you have a few options
detect_match()
and friends). Or, if your scorer outputs code, you could check if it compiles/parses and then run that code through a unit test.target
) to a model, and ask the model to assign some score. (vitals natively supports factors with levels I < P < C
, standing for Incorrect, (optional) Partially Correct, and Correct.)| Technique | Deterministic Scoring | LLM-as-a-judge / model grading | Human grading | |------------------|------------------|------------------|------------------| | Speed | Very fast | Pretty fast | Very slow | | Cost | Very cheap / "free" | Pretty cheap | Very expensive | | Applicability | Narrow | Broad | Broad |
Understandably, many people bristle at the thought of LLMs evaluating their own output; while these systems do take some careful refinement, "when implemented well, LLM-as-Judge achieves decent correlation with human judgments" [@yan2024building]. While vitals will safeguard you from many common LLM-as-a-judge missteps with model_graded_qa()
, some recommendations on implementing these systems for your own use cases:
You should still be looking at lots of data yourself: "After looking at the data, it's likely you will find errors in your AI system. Instead of plowing ahead and building an LLM judge, you want to fix any obvious errors. Remember, the whole point of the LLM as a judge is to help you find these errors, so it's totally fine if you find them earlier" [@husain2024judge]. Especially when you first implement the LLM-as-a-judge, look at its reasoning output and ensure that it aligns with your own intuitions as closely as possible. Sometimes this means fixing the judge, and sometimes it means going back and editing grading guidance in the samples themselves.
Start with binary judgments: It may be tempting to implement some sort of scoring rubric that assigns a numeric score to output. Instead, start off by trying to align a scorer that only returns Pass/Fail with your own intuitions about whether an output is satisfactory. With the right instructions
argument, you might be able to get model_graded_qa()
to output trusted judgments reliably.
Use strong models to judge: In your production system, you might be interested in driving down costs and reducing latency by choosing smaller models in your solver. In your LLM-as-a-judge system, though, stick with "flagship" models like Claude 4 Sonnet, GPT 4.1, Gemini 2.5 Pro, etc.; you're much more likely to get results you can begin to trust from larger models.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.