Research

Tested the 'top of 2025' LLMs on a real task. GPT-3.5 won.

MMLU Pro winners against MMLU Pro loosers. How big is the gap on a single, repeatable task. The result is surprising.

Tested the 'top of 2025' LLMs on a real task. GPT-3.5 won.

TL;DR: We tested the top 3 and bottom 2 models from MMLU Pro rankings on a real task. The "worst" model (GPT-3.5 Turbo from 2023) beat every "top" model—while costing 97% less than Claude Opus 4.1.

MMLU Pro is a comprehensive benchmark developed in 2024 to test the knowledge of LLMs. Claude Opus 4.1 holds the top place with 87% (according to Open Benchmarks).

But how would it compare on a repeatable task instead of an academic questionnaire?

The Task: Extract the name of the author from news articles

Simple, right? Not quite. The author's name is buried in the HTML—usually in a photo credit or caption. The model needs to actually read and understand the page structure.

Setup

  • Get 100 articles from NewsAPI from October 13, 2025. Keywords: 'Sydney', 'Australia', 'Melbourne'. Generic enough. The data included the author name and link to the original article.
  • Scrape raw HTML of the 100 articles
  • Clean HTML using trafilatura, which extracted just the main article content from the webpage while removing all the navigation, ads, and other clutter.
  • Sanity check the HTML and filter out all the articles that don't have an author (despite being reported by NewsAPI)
  • Ask models to identify the article author (using identical prompt)
  • Evaluate the exact response (so rambling gets no point)

The system prompt was dead simple:

Extract the author(s) from this HTML of a news article.
 
Return format:
Single author: first_name last_name
Multiple authors: first_name last_name and first_name last_name
 
Examples:
Jane Smith
John Doe and Mary Johnson

Models tested:

  • Top 3 from MMLU: Claude Opus 4.1, Gemini 2.5 Pro, GPT-5
  • Bottom 2 from MMLU: Ministral 8B, GPT-3.5 Turbo

The First result was... surprising

ModelMMLU-Pro Score (Open Benchmarks)Our Test AccuracyAvg Response TimeCost per 1M requests
GPT-3.5 Turbo46.9% (bottom tier)61.5%489.769ms$823.87
GPT-587.1% (2nd place)59.6%13,305.827ms$8,099.45
Gemini 2.5 Pro Preview 06-0586.9% (3rd place)55.8%15,612.25ms$15,799.09
Ministral 8B42.8% (bottom tier)40.4%696.615ms$167.47
Claude Opus 4.187.9% (1st place)34.6%3,972.904ms$30,390.87

The 2023 model beat the 2025 flagship. By a lot. While costing 3% of the Claude cost ($823 vs $30,390 per 1M requests).

Here's what happened

On a single example from this ABC News article on Tasmanian Saltmarsh restoration,

Claude Opus 4.1 and Gemini 2.5 Pro responded:

"Based on the HTML content, I cannot identify a specific author..."

Meanwhile, GPT-3.5 Turbo just said:

"Madeleine Rojahn"

(Which was correct.)

The irony: Claude often did identify the correct author—it just buried the answer in verbose explanations about HTML structure, metadata analysis, and confidence levels. The system prompt asked for a simple format. Claude gave a dissertation. GPT-3.5 just followed instructions.

The "smarter" models overthought it. The older model just did the job.

That can't be true

I did not believe it and rerun the test 3 times, same setup same models just repeated. to see the actual impact.

ModelOur Test AccuracyAvg Response TimeCost per 1M
GPT-3.5 Turbo61.5%489.769ms$823.87
GPT-559.6%13,305.827ms$8,099.45
Gemini 2.5 Pro Preview 06-0555.8%15,612.25ms$15,799.09
Ministral 8B40.4%696.615ms$167.47
Claude Opus 4.134.6%3,972.904ms$30,390.87

The initial 61.5% was on the higher end of the variance.

The next three tests showed GPT-3.5 scoring 50.0%, 51.9%, and 53.8%—averaging around 54% across all runs.

Here's what stayed consistent:

  • GPT-3.5 remained comparable to GPT-5 and Gemini 2.5 Pro. Despite being two years older and scoring ~35 points lower on MMLU Pro, it matched the flagship models on this real-world task.

  • GPT-3.5 consistently beat Claude Opus 4.1. Every single time. The MMLU Pro champion couldn't compete with the "bottom tier" model.

  • Ministral 8B performed comparably to Claude Opus 4.1. Despite a 40-point gap on MMLU Pro (47% vs 87.9%), both "bottom tier" models scored within 6 percentage points on this task—suggesting the benchmark massively overestimated the performance difference.

Comparison showing GPT-3.5 Turbo outperforming flagship models like GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1 on a real-world task, despite ranking lower on MMLU Pro benchmarks

Takeaway: leaderboards like MMLU don't give a full picture.

MMLU tests academic knowledge and reasoning. Our task tested practical instruction-following on messy real-world data.

Completely different skill sets. And the benchmark didn't predict which model would win.

Your use case isn't MMLU. Your use case needs its own leaderboard.


Build your own benchmark in 5 minutes → Start testing for free