Why benchmark yourself?

Test models on your actual product data to make routing decisions

Limitations of academic benchmarks

Academic benchmarks like MMLU, HumanEval, or MT-Bench don't tell you which model is best for your product. This happens for three reasons:

Winners of academic benchmarks often are an overkill for real-world use cases
Model creators game the academic benchmarks
Academic benchmarks don't generalize well

Academic benchmarks overindex on expensive models

State of the Art models that top academic benchmarks are expensive. Smaller models can be equally effective in solving real-world use cases, despite performing lower on academic benchmarks.

Academic benchmarks evaluate models on tasks that are an approximation of what users would do. This approximation is not perfect. A model that excels at PhD level multiple-choice questions is not guaranteed to excel at customer support tickets or structured text extraction.

Real example: GPT-3.5 can outperform GPT-4 on real product tasks despite scoring lower on MMLU benchmark.

Model creators are gaming academic benchmarks

Academic benchmarks are public and therefore can be gamed. Recent research exposed how model creators can manipulate benchmark scores:

Chatbot Arena manipulation: A major study by Singh et al. (2025) revealed that companies like Meta, OpenAI, and Google privately test dozens of model variants on Chatbot Arena, then only publish scores from the best-performing version. Meta alone tested 27 private variants before the Llama 4 release.
Data contamination: Deng et al. (2024) found that GPT-4 could guess 57% of missing answers in MMLU benchmark questions, suggesting the model had seen the benchmark data during training. Yang et al. (2023) demonstrated that a 13B model could achieve GPT-4-level performance on benchmarks through contamination.

Benchmarks measure what's easy, not what matters

Academic benchmarks focus on what's convenient to measure. Very few people use LLMs to solve complex math problems (MATH, GSM8K) or answer scientific questions (MMLU, GPQA), yet these dominate benchmark suites because they're easy to score automatically.

Your business doesn't run on what's easy to measure. It runs on:

Conversion rates - Do responses lead to sales?
User satisfaction - Does the NPS score improve?
Code acceptance - What portion of generated code gets merged?
Support resolution - How many tickets get resolved on first contact?
Engagement - Do users continue the conversation?

These metrics are harder to measure than multiple-choice accuracy, but they're what actually matter for your business.

Custom benchmarks versus evals

Evals are time consuming and limited

Traditional evals are unit tests for LLMs. The engineer writes expected outputs and check if the model matches. But this approach has fundamental issues:

Evals must be representative and in sync with the product - test cases have to actually reflect production usage, which is hard, as the product evolves and user behavior shifts
Evals are time-consuming - writing good evals takes time
Evals don't measure what matters - passing an eval does not guarantee mean that the users convert

Custom benchmarks are easy and representative

Instead of writing evals, use your actual product data:

Take real queries from your production logs
Send them to multiple models
Measure actual product metrics (not synthetic "correctness")
Let the data tell you which model performs best

This is exactly how the industry optimizes ads, pricing or email marketing.

Ready to benchmark on your product data?

Narev handles the complexity of running parallel requests, collecting data, and summarizing results. You just point us at your queries and tell us which models to test, with your existing stack.

Quickstart - Routing Benchmarks

Still have questions? Ask on Discord

On This Page