Why A/B tests?

Test models on your actual product data to make routing decisions

A/B tests and experiments are controlled tests to find which model works best for your specific use case.

Limitations of academic benchmarks

Academic benchmarks like MMLU, HumanEval, or MT-Bench don't tell you which model is best for your product.

Academic benchmarks are not generalizable

Benchmarks evaluate models on tasks that are most likely completely different from what your users do. A model that excels at multiple-choice questions is not guaranteed to excel at customer support queries. A model that tops coding benchmarks may not work well for structured text extraction.

Real example: GPT-3.5 can outperform GPT-4 on real product tasks despite scoring lower on MMLU benchmark.

Providers are gaming the benchmarks

Even the most respected benchmarks are being systematically gamed by major AI providers. Recent research has exposed how companies manipulate benchmark scores to look better than they actually are:

  • Chatbot Arena manipulation: A major study by Singh et al. (2025) revealed that companies like Meta, OpenAI, and Google privately test dozens of model variants on Chatbot Arena, then only publish scores from the best-performing version. Meta alone tested 27 private variants before the Llama 4 release.
  • Data contamination: Deng et al. (2024) found that GPT-4 could guess 57% of missing answers in MMLU benchmark questions, suggesting the model had seen the test data during training. Yang et al. (2023) demonstrated that a 13B model could achieve GPT-4-level performance on benchmarks through contamination.

Benchmarks measure what's easy, not what matters

Academic benchmarks focus on what's convenient to measure, not what users actually care about. Very few people use LLMs to solve complex math problems (MATH, GSM8K) or answer scientific questions (MMLU, GPQA), yet these dominate benchmark suites because they're easy to score automatically.

Your business doesn't run on what's easy to measure. It runs on:

  • Conversion rates - Do responses lead to sales?
  • User satisfaction - Does the NPS score improve?
  • Code acceptance - What portion of generated code gets merged?
  • Support resolution - How many tickets get resolved on first contact?
  • Engagement - Do users continue the conversation?

These metrics are harder to measure than multiple-choice accuracy, but they're what actually matter for your business.

A/B tests are a better approach to testing than evals

Evals are hard and limited

Traditional evals are unit tests for LLMs. The engineer writes expected outputs and check if the model matches. But this approach has fundamental issues:

  • Evals must be representative and in sync with the product - test cases have to actually reflect production usage, which is incredibly hard, especially as the product evolves and user behavior shifts
  • Evals are time-consuming - writing good evals takes as much time as writing the feature
  • Evals don't measure what matters - passing an eval doesn't mean that the users will be happy

A/B tests are easy and representative

Instead of writing evals, use your actual product data:

  1. Take real queries from your production logs
  2. Send them to multiple models
  3. Measure actual product metrics (not synthetic "correctness")
  4. Let the data tell you which model performs best

This is exactly how the industry optimizes ads, pricing or email marketing.

Why A/B test with Narev

Effortless testing

We handle the complexity of running parallel requests, collecting results, and analyzing data. You just point us at your queries and tell us which models to test.

Integrate with everything

Keep your existing stack. Use your current gateway (OpenRouter, LiteLLM, etc.), your tracing tools (Langfuse, LangSmith), and your model providers. We integrate with all of them.

Import test data from anywhere:

  • Production traces
  • File imports
  • Narev gateway

Measure your metrics

We don't force you to use our metrics. Track whatever matters to your product:

  • Your product KPIs
  • User feedback signals
  • Domain-specific quality measures