Why benchmark?

Limitations of academic benchmarks

Academic benchmarks like Massive Multitask Language Understanding, HumanEval, or MT-Bench don’t tell you which model is best for your product. This happens for three reasons:

Winners of academic benchmarks often are an overkill for real-world use cases
Model creators game the academic benchmarks
Academic benchmarks don’t generalize well

Academic benchmarks favor expensive models

State-of-the-art models that top academic benchmarks are expensive. Smaller models can be equally effective in solving real-world use cases, despite performing lower on academic benchmarks. Academic benchmarks evaluate models on tasks that approximate what users do. This approximation isn’t perfect. A model that excels at PhD-level multiple-choice questions isn’t guaranteed to excel at customer support tickets or structured text extraction.

Real example: GPT-3.5 can outperform GPT-4 on real product tasks despite scoring lower on the Massive Multitask Language Understanding benchmark.

Model creators are gaming academic benchmarks

Academic benchmarks are public, and model creators can game them. Recent research exposed how model creators can manipulate benchmark scores:

Arena manipulation: A major study by Singh et al., 2025 revealed that companies like Meta, OpenAI, and Google privately test dozens of model variants on the Large Model Systems Organization Arena, then only publish scores from the best-performing version. Meta alone tested 27 private variants before the Llama 4 release.
Data contamination: Deng et al., 2024 found that Generative Pre-trained Transformer 4 could guess 57% of missing answers in Massive Multitask Language Understanding benchmark questions, suggesting the model had seen benchmark data during training. Yang et al., 2023 demonstrated that a 13-billion-parameter model could achieve Generative Pre-trained Transformer 4 level performance on benchmarks through contamination.

Benchmarks measure what’s easy, not what matters

Academic benchmarks focus on what’s convenient to measure. Very few people use large language models to solve complex math problems, such as Measuring Mathematical Problem Solving and Grade School Math 8K, or answer scientific questions, such as Massive Multitask Language Understanding and Graduate-Level Google-Proof Question and Answer tasks. Yet these dominate benchmark suites because they’re easy to score automatically. Your business doesn’t run on what’s easy to measure. It runs on:

Conversion rates: Do responses lead to sales?
User satisfaction: Does the Net Promoter Score improve?
Code acceptance: What portion of generated code gets merged?
Support resolution: How many tickets get resolved on first contact?
Engagement: Do users continue the conversation?

These metrics are harder to measure than multiple-choice accuracy, but they’re what actually matter for your business.

Custom benchmarks versus evaluations

Evaluations are time-consuming and limited

Traditional evaluations are unit tests for large language models. The engineer writes expected outputs and checks whether the model matches. But this approach has fundamental issues:

Evaluations must be representative and in sync with the product: Test cases have to reflect production usage, which is hard as your product evolves and user behavior shifts.
Evaluations are time-consuming: Writing good evaluations takes time.
Evaluations don’t measure what matters: Passing an evaluation doesn’t guarantee users convert.

Custom benchmarks are easy and representative

Instead of writing evaluations, use your actual product data:

Take real queries from your production logs.
Send them to multiple models.
Measure actual product metrics, not synthetic “correctness.”
Let the data tell you which model performs best.

This is exactly how the industry optimizes ads, pricing, and email marketing.

Ready to benchmark on your product data?

Narev handles the complexity of running parallel requests, collecting data, and summarizing results. You point Narev at your queries and tell it which models to test, with your existing stack.

Guides

FinOps for AI

Cost Optimization

Limitations of academic benchmarks

Academic benchmarks favor expensive models

Model creators are gaming academic benchmarks

Benchmarks measure what’s easy, not what matters

Custom benchmarks versus evaluations

Evaluations are time-consuming and limited

Custom benchmarks are easy and representative

Ready to benchmark on your product data?

Guides

FinOps for AI

Cost Optimization

Documentation Index

​Limitations of academic benchmarks

​Academic benchmarks favor expensive models

​Model creators are gaming academic benchmarks

​Benchmarks measure what’s easy, not what matters

​Custom benchmarks versus evaluations

​Evaluations are time-consuming and limited

​Custom benchmarks are easy and representative

​Ready to benchmark on your product data?

Limitations of academic benchmarks

Academic benchmarks favor expensive models

Model creators are gaming academic benchmarks

Benchmarks measure what’s easy, not what matters

Custom benchmarks versus evaluations

Evaluations are time-consuming and limited

Custom benchmarks are easy and representative

Ready to benchmark on your product data?