Metrics

How you evaluate A/B tests: quality, latency, and cost.

What are metrics?

Metrics measure how well your variants perform across three dimensions: quality, latency, and cost.

When do you use metrics?

Use metrics to evaluate A/B test results and determine which variant performs best for your use case. Different metrics matter for different scenarios—prioritize what's important to you.

How do metrics work?

Quality Metrics

Quality metrics evaluate the correctness and usefulness of responses. Some require expected_output (the answer it should give), while others don't:

Requires expected output:

  • LLM Judge (Binary) - AI evaluator gives pass/fail scores
  • Exact Match - Checks if output exactly matches expected answer
  • Fuzzy Match - Flexible matching with substring and keyword overlap
  • Structured Output Values - Compares JSON values with expected output

No expected output needed:

  • LLM Judge (Ranking) - AI evaluator ranks responses from all variants
  • Structured Output Schema - Validates JSON/structured format compliance
  • Manual Evaluation - Human reviewers compare and score variants

Latency Metrics

  • Total Execution Time - Complete request duration
  • Time to First Token (TTFT) - How fast the response starts
  • Tokens Per Second - Streaming throughput

Cost Metrics

  • Cost Per Prompt - Individual request cost
  • Cost Per Variant - Total cost for all prompts on a variant
  • Cost Per A/B Test - Aggregate cost across all variants