Reduce LLM spend by prompt engineering

How a shorter, simpler prompt cut costs by 24% with the same accuracy and better consistency

The Scenario

You've picked the right model for your email routing task. Everything works perfectly—emails get classified correctly, routed to the right teams, and your customers are happy. But then you notice something interesting in your logs: you're spending more on input tokens than output tokens. The question worth asking is: does your prompt need to be that long?

This is a very simple example illustrating the mechanism of A/B testing for LLMs. If you're after something more complex, check our other guides.

The Hypothesis

Many developers write verbose, detailed prompts thinking more context equals better results. But for straightforward tasks like email classification, a shorter prompt might work just as well. Every word in your system prompt costs money—it's sent with every single request. So there's a good chance that a concise, direct prompt could deliver the same accuracy while cutting costs by reducing input tokens.

But hunches aren't enough, let's test this hypothesis.

Test Configuration

This example shows a side-by-side comparison using the Narev Platform, testing two different prompts using the same model (gpt-oss-20b) on identical customer emails.

The only variable? The prompt style.

This creates a clean way to see how prompt length affects cost, speed, and accuracy.

Prompt experiment setup showing side-by-side comparison

Prompts Tested

Concise Prompt (225 tokens average):

Classify this customer email into one category:

TECHNICAL_SUPPORT: bugs, crashes, sync issues, app problems
BILLING_INQUIRY: payments, invoices, refunds, subscription questions
SALES: demos, pricing, partnerships, new customer inquiries

Route to: mobile_dev_team, billing_team, sales_team, or escalation_team

Respond in JSON:
{"category": "", "department": ""}

Verbose Prompt (443 tokens average):

You are a professional email classification system for customer support operations.
Your task is to accurately categorize incoming emails and route them to appropriate
departments for optimal response times and customer satisfaction.

CLASSIFICATION CATEGORIES:

1. TECHNICAL_SUPPORT:
   - Application bugs, crashes, freezing, performance issues
   - Installation difficulties, setup problems, configuration errors
   - Device compatibility, sync failures, API integration issues
   - Feature malfunctions, UI problems, connectivity errors

2. BILLING_INQUIRY:
   - Invoice questions, payment disputes, refund requests
   - Subscription changes, billing cycles, pricing inquiries
   - Account management, payment method updates
   - Tax questions, international billing concerns

3. SALES:
   - Product demonstrations, pricing information, quote requests
   - Partnership inquiries, competitive evaluations
   - Implementation planning, volume licensing discussions
   - New customer onboarding, trial requests

ROUTING GUIDELINES:
Technical issues should be directed to mobile_dev_team (for mobile app problems),
or escalation_team (for critical outages). Billing matters go to billing_team.
Sales inquiries route to sales_team.

ANALYSIS PROCESS:
Carefully examine the email subject and content for key indicators. Consider the
sender's business context, urgency level, and specific terminology used. Look for
technical keywords, financial references, or sales-related language to determine
proper classification.

Please provide your classification as a JSON response with "category" and
"department" fields only.

What Gets Measured

The test tracks three critical metrics for each prompt:

the cost per request (shorter prompts = fewer input tokens = lower costs)
the speed of classification (measured by time to first token)
the accuracy of the routing decisions (cheaper is only worth it if it still works)

Results

The results reveal something surprising: the verbose, "professional" prompt doesn't just fail to improve accuracy—it's actually worse on every metric. Both prompts achieve 100% accuracy on the test emails, but the concise prompt uses nearly half the input tokens AND responds faster. The verbose prompt's extra context and formatting doesn't add value—it just burns tokens and slows things down.

Prompt experiment results comparison

Prompt Style	Cost (avg)	Speed (avg)	Accuracy
Verbose	$26.57/1M	2,108ms	100%
Concise ✅	$20.39/1M (-23%)	1,941ms (-8%)	100%

Winner: Concise Prompt

The concise prompt wins decisively. It saves $6.18 per million requests (a 23% reduction) AND responds 167ms faster per request (8% faster) while maintaining 100% accuracy. There's no tradeoff here—the simpler prompt is better in every way. The verbose prompt's detailed categorization guidelines and analysis instructions don't just waste money, they also slow down inference.

Example in Action

Let's look at how the prompts handle a typical sales inquiry. This is the kind of email that should be an easy win—clear enterprise sales intent, multiple specific questions, and an explicit request for a demo.

Subject: Enterprise plan pricing
From: cto@fastgrowth.io

Hi there,

We're a 200-person company looking to upgrade from our current solution. Could you send me information about:

Enterprise plan features
Volume discounts
Implementation timeline
API rate limits

We'd also like to schedule a demo for next week.

Thanks,
Mike Chen, CTO

Should route to: Sales team

How Each Prompt Did

Prompt Style	Got it right?
Verbose	✅ Yes
Concise	✅ Yes

Good news: both prompts correctly route this email to sales. This example demonstrates that for straightforward cases (which make up the majority of customer emails), the concise prompt performs just as accurately. The verbose prompt's detailed categorization guidelines and analysis instructions don't add value—they just burn tokens and slow down inference.

Individual prompt result comparison

What This Shows

Prompt verbosity doesn't equal better results. For straightforward tasks like email classification, a concise prompt is 23% cheaper AND 8% faster while maintaining 100% accuracy.

You don't have to compromise anything. The simpler prompt wins on cost, speed, AND quality—perfect classification that costs less and responds faster on every single request.

The Takeaway

Using a concise prompt saves $6.18 per million emails (23% cost reduction) and responds 167ms faster (8% speed improvement) while maintaining 100% accuracy. Pay $20.39 instead of $26.57 per million requests.

Overall impact of prompt optimization

There's no reason to use the verbose prompt. The concise prompt wins on every metric: lower costs, faster responses, and the same perfect accuracy. The savings and speed improvements compound with every single API call.

Want to test which prompt works best for your use case? Start testing for free →

See all guides