Metrics

How you evaluate A/B tests: quality, latency, and cost.

What are metrics?

Metrics measure how well your variants perform across three dimensions: quality, latency, and cost.

When do you use metrics?

Use metrics to evaluate A/B test results and determine which variant performs best for your use case. Different metrics matter for different scenarios—prioritize what's important to you.

How do metrics work?

Quality Metrics

Quality metrics evaluate the correctness and usefulness of responses. Some require expected_output (the answer it should give), while others don't:

Requires expected output:

LLM Judge (Binary) - AI evaluator gives pass/fail scores
Exact Match - Checks if output exactly matches expected answer
Fuzzy Match - Flexible matching with substring and keyword overlap
Structured Output Values - Compares JSON values with expected output

No expected output needed:

LLM Judge (Ranking) - AI evaluator ranks responses from all variants
Structured Output Schema - Validates JSON/structured format compliance
Manual Evaluation - Human reviewers compare and score variants

Latency Metrics

Total Execution Time - Complete request duration
Time to First Token (TTFT) - How fast the response starts
Tokens Per Second - Streaming throughput

Cost Metrics

Cost Per Prompt - Individual request cost
Cost Per Variant - Total cost for all prompts on a variant
Cost Per A/B Test - Aggregate cost across all variants

Variants Manual Evaluations