ResearchOctober 13, 2025

The MMLU Benchmark Reproducibility Problem

MMLU scores vary by 13 points for the same model depending on who's measuring. Yet the "top" models differ by just 1%.

oskar

@oskarkocol

TL;DR: GPT-4o's MMLU-Pro score varies by 13 percentage points depending on who measures it. Meanwhile, the "top" three models differ by just 1%. The numbers you're using to pick your LLM are essentially meaningless.

MMLU Pro (Massive Multitask Language Understanding) is one of the hottest benchmarks in AI, measuring "a text model's multitask accuracy" according to the original Arxiv paper from 2024.

As the authors state: "To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability."

But when the same model scores differently by 13 percentage points across different sources, the scores lose their meaning.

The Results Are Not Reproducible

Here's a fun experiment: Look up GPT-4o's MMLU-Pro score across different sources.

According to the original paper? 72.6%
According to Kaggle Open Benchmarks? 73.0% ±0.8%
According to LLM Stats? 85.7%
According to Artificial Analysis? 74.0% (GPT-4o May), 74.8% (GPT-4o Nov), or 77.3% (GPT-4o ChatGPT)

So... which one is real?

The answer: None of them. All of them. It doesn't matter.

Are they repeatable? Hard to tell. Each measurement is different - that we know for sure.

The top models are separated by rounding errors

This gets especially absurd when you look at the top performers. On Kaggle's Open Benchmarks leaderboard, the top three models differ by exactly 1%:

Claude Opus 4.1 (2025-08-05): 87.9%
GPT-5 (2025-08-07): 87.1%
Claude Opus 4 (2025-05-14): 86.9%

Given that GPT-4o's score swings by 13 percentage points depending on who's measuring, that 1% difference is literally noise. It looks like GPT-4o could be anywhere on this benchmark.

Why Does This Happen?

Different leaderboards use:

Different prompting strategies
Different evaluation protocols (how they parse answers)
Different testing conditions

The "standardized" benchmark isn't standardized at all.

What this means for you

If you're choosing an LLM based on MMLU scores, you're essentially picking based on vibes. The model that's "#1" on one leaderboard is #4 on another.

The solution? Test models on your actual use case. With your data. Under your conditions.

Running your own benchmark shifts everything

Want to see how much the rankings change when you test on real tasks instead of MMLU?

I ran a simple benchmark using 100 articles from NewsAPI about Australia, Sydney, and Melbourne, asking models to extract author names from the HTML. A straightforward task, but one that requires understanding real-world data structure.

The results completely flipped the leaderboard. See the full analysis here. But here's the kicker: GPT-3.5 Turbo (from 2023) was comparable with every flagship 2025 model, and beat the MMLU Pro champion Claude Opus 4.1.

That's the only number that matters—the one that reflects your actual use case.

Build your own benchmark in 5 minutes → Start testing for free