Quality Evaluations

Define how to evaluate if the variant performed correctly.

What are Quality Evaluations?

Quality evaluations measure how well your variants performed.

There are two types of quality evaluations:

  • require source of truth (e.g., LLM Judge)
  • do not require source of truth (e.g., Structured Output Schema)

Source of truth can be defined with:

  • Expected output - defined when the benchmark is created, best for benchmarks where the response is obvious
  • State of the art model - defined by the State of the Art model

When do you use Quality Evaluations?

Use quality evaluations to evaluate the performance of a variant on the benchmark.

How do Quality Evaluations work?

Under the hood, quality evaluations take model's response as an input and evaluate the correctness of the responses.