Documentation Index
Fetch the complete documentation index at: https://narev.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Limitations of academic benchmarks
Academic benchmarks like Massive Multitask Language Understanding, HumanEval, or MT-Bench don’t tell you which model is best for your product. This happens for three reasons:- Winners of academic benchmarks often are an overkill for real-world use cases
- Model creators game the academic benchmarks
- Academic benchmarks don’t generalize well
Academic benchmarks favor expensive models
State-of-the-art models that top academic benchmarks are expensive. Smaller models can be equally effective in solving real-world use cases, despite performing lower on academic benchmarks. Academic benchmarks evaluate models on tasks that approximate what users do. This approximation isn’t perfect. A model that excels at PhD-level multiple-choice questions isn’t guaranteed to excel at customer support tickets or structured text extraction.- Real example: GPT-3.5 can outperform GPT-4 on real product tasks despite scoring lower on the Massive Multitask Language Understanding benchmark.
Model creators are gaming academic benchmarks
Academic benchmarks are public, and model creators can game them. Recent research exposed how model creators can manipulate benchmark scores:- Arena manipulation: A major study by Singh et al., 2025 revealed that companies like Meta, OpenAI, and Google privately test dozens of model variants on the Large Model Systems Organization Arena, then only publish scores from the best-performing version. Meta alone tested 27 private variants before the Llama 4 release.
- Data contamination: Deng et al., 2024 found that Generative Pre-trained Transformer 4 could guess 57% of missing answers in Massive Multitask Language Understanding benchmark questions, suggesting the model had seen benchmark data during training. Yang et al., 2023 demonstrated that a 13-billion-parameter model could achieve Generative Pre-trained Transformer 4 level performance on benchmarks through contamination.
Benchmarks measure what’s easy, not what matters
Academic benchmarks focus on what’s convenient to measure. Very few people use large language models to solve complex math problems, such as Measuring Mathematical Problem Solving and Grade School Math 8K, or answer scientific questions, such as Massive Multitask Language Understanding and Graduate-Level Google-Proof Question and Answer tasks. Yet these dominate benchmark suites because they’re easy to score automatically. Your business doesn’t run on what’s easy to measure. It runs on:- Conversion rates: Do responses lead to sales?
- User satisfaction: Does the Net Promoter Score improve?
- Code acceptance: What portion of generated code gets merged?
- Support resolution: How many tickets get resolved on first contact?
- Engagement: Do users continue the conversation?
Custom benchmarks versus evaluations
Evaluations are time-consuming and limited
Traditional evaluations are unit tests for large language models. The engineer writes expected outputs and checks whether the model matches. But this approach has fundamental issues:- Evaluations must be representative and in sync with the product: Test cases have to reflect production usage, which is hard as your product evolves and user behavior shifts.
- Evaluations are time-consuming: Writing good evaluations takes time.
- Evaluations don’t measure what matters: Passing an evaluation doesn’t guarantee users convert.
Custom benchmarks are easy and representative
Instead of writing evaluations, use your actual product data:- Take real queries from your production logs.
- Send them to multiple models.
- Measure actual product metrics, not synthetic “correctness.”
- Let the data tell you which model performs best.