scorer_model: Model-based scoring
In vitals: Large Language Model Evaluation

scorer_model

R Documentation

Model-based scoring

Description

Model-based scoring makes use of a model to score output from a solver.

model_graded_qa() scores how well a solver answers a question/answer task.
model_graded_fact() determines whether a solver includes a given fact in its response.

The two scorers are quite similar in their implementation, but use a different default template to evaluate correctness.

Usage

model_graded_qa(
  template = NULL,
  instructions = NULL,
  grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
  partial_credit = FALSE,
  scorer_chat = NULL
)

model_graded_fact(
  template = NULL,
  instructions = NULL,
  grade_pattern = "(?i)GRADE\\s*:\\s*([CPI])(.*)$",
  partial_credit = FALSE,
  scorer_chat = NULL
)

Arguments

`template`	Grading template to use–a `glue()` string which will take substitutions `input`, `answer`, `criterion`, `instructions`.
`instructions`	Grading instructions.
`grade_pattern`	A regex pattern to extract the final grade from the judge model's response.
`partial_credit`	Whether to allow partial credit.
`scorer_chat`	An ellmer chat used to grade the model output, e.g. `ellmer::chat_anthropic()`.

Value

A function that will grade model responses according to the given instructions. See Task's scorer argument for a description of the returned function. The functions that model_graded_qa() and model_graded_fact() output can be passed directly to ⁠$eval()⁠.

See the documentation for the scorer argument in Task for more information on the return type.

Examples

# Quality assurance -----------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
  # set the log directory to a temporary directory
  withr::local_envvar(VITALS_LOG_DIR = withr::local_tempdir())

  library(ellmer)
  library(tibble)

  simple_addition <- tibble(
    input = c("What's 2+2?", "What's 2+3?"),
    target = c("4", "5")
  )

  tsk <- Task$new(
    dataset = simple_addition,
    solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")),
    scorer = model_graded_qa()
  )

  tsk$eval()
}

# Factual response -------------------------------
if (!identical(Sys.getenv("ANTHROPIC_API_KEY"), "")) {
  # set the log directory to a temporary directory
  withr::local_envvar(VITALS_LOG_DIR = withr::local_tempdir())

  library(ellmer)
  library(tibble)

  r_history <- tibble(
    input = c(
      "Who created the R programming language?",
      "In what year was version 1.0 of R released?"
    ),
    target = c("Ross Ihaka and Robert Gentleman.", "2000.")
  )

  tsk <- Task$new(
    dataset = r_history,
    solver = generate(solver_chat = chat_anthropic(model = "claude-3-7-sonnet-latest")),
    scorer = model_graded_fact()
  )

  tsk$eval()
}

vitals documentation built on June 24, 2025, 9:08 a.m.