Optimize
Test systematically. Measure ruthlessly. Deploy winners confidently. Here's how to cut costs without sacrificing quality.
At the end of this step, you should have answers to the following questions:
- Which use case are we optimizing first, and why?
- What's our success criteria? (Minimum acceptable performance + primary metric)
- What did our tests reveal? (Which configuration won, and by how much?)
0. Intro
You've identified where your money goes (Step 2) and what you're optimizing for (Step 1). Now comes the fun part: finding cheaper, faster, or better configurations.
But here's the trap: Most teams see a 90% cost reduction in a blog post, swap their model, and immediately regret it. Quality tanks. Users complain. They roll back and assume "cheaper models don't work."
That's not true. What doesn't work is blind changes without testing.
The right approach is systematic: test one variable at a time, measure the impact, keep what works.
Always optimize in this order:
- Prompt engineering (free, fast, often dramatic impact)
- Model selection (where the biggest cost savings live)
- Parameter tuning (the final 10-20% of improvement)
Let's break down each step.
1. Optimize your prompt first
Before you touch the model or parameters, squeeze every drop of performance from your prompt. Small changes can have massive impact—and they cost nothing.
1.1 Be specific about format
❌ "Classify this email"
✅ "Classify this email. Return only: spam, urgent, or normal"
1.2 Constrain output length
❌ "Summarize this article"
✅ "Summarize this article in exactly 3 bullet points, max 15 words each"
1.3 Use structured output
❌ "Extract the customer's name, email, and issue"
✅ "Return JSON:
{"name": string, "email": string, "issue": string}"
1.4 Remove unnecessary instructions
- Don't say "You are a helpful assistant" if it doesn't affect output
- Don't ask for explanations if you only need the answer
- Don't request markdown formatting if plain text works
Don't over-optimize blindly. Test every prompt change. Sometimes verbosity helps. Sometimes "explain your reasoning" actually improves accuracy. Let data decide.
2. Test different models
Prompt optimization alone might have saved you enough to call it a win - and that's perfectly valid. But if you're ready to push further, model testing is where the really dramatic savings hide. Just know: this is also where quality can slip if you're not careful.
The secret: There are over 300 models available. Most teams only try 3 to 5.
2.1 Ignore the leaderboards
Before you start testing, ignore everything you've read about model benchmarks.
Here's the uncomfortable truth: Benchmarks don't predict real-world performance on your use case.
There are hundreds of benchmarks measuring model intelligence:
- MMLU (general knowledge)
- HumanEval (code generation)
- GPQA (graduate-level reasoning)
- HellaSwag (commonsense reasoning)
- TruthfulQA (factual accuracy)
- BBH (reasoning challenges)
- MT-Bench (multi-turn conversations)
- ...and 200+ more
What do these benchmarks actually measure? How well models respond to benchmarks.
That's it. They don't measure:
- Performance on your prompts
- Performance on your data distribution
- Performance on your edge cases
- Cost efficiency for your use case
- Latency in your infrastructure
A model that scores 94% on MMLU might be terrible at classifying your support tickets. A model that ranks #47 on the leaderboard might be perfect for generating your product descriptions.
The correlation between benchmark scores and real-world performance on specific tasks is weak at best.
Why benchmarks mislead
1. They're not your data
Benchmarks test on curated datasets:
- Academic questions with clear right answers
- Sanitized inputs with no typos or edge cases
- English-only (usually), when your users might write in Spanglish
Your real data is messy. Users make typos. They write run-on sentences. They reference context you need to infer.
2. They're not your prompts
Benchmarks use standardized prompts optimized for the test. Your production prompts are different—you've tuned them for your specific use case, added company context, constrained output format.
3. Gaming is rampant
Models are increasingly trained on benchmark data. A model that scores 96% on HumanEval isn't necessarily a better coder—it might just have seen those exact problems during training.
4. They ignore what you care about
Does the benchmark measure:
- Token efficiency? (No. Verbosity is fine in benchmarks)
- Output consistency? (No. One correct answer is enough)
- Latency? (No. Time doesn't matter)
- Cost per successful outcome? (No. Accuracy alone wins)
But these metrics determine whether a model actually works for your business.
The only benchmark that matters is your benchmark. Test on your data, with your prompts, measuring your metrics. Everything else is marketing.
2.2 Models to consider
Don't just test the obvious choices (GPT-4, Claude, Gemini). Explore:
Lightweight models from major providers:
- GPT-4o-mini, GPT-4.1-nano
- Claude Haiku, Claude Sonnet
- Gemini Flash, Gemini Flash 8B
Open-source models via API:
- Llama 3.1 (8B, 70B, 405B)
- Mixtral 8x7B, Mixtral 8x22B
- Qwen 2.5 (various sizes)
- Command R, Command R+
Specialized models:
- Anthropic's models for analysis and reasoning
- Cohere for classification and embeddings
- OpenAI o1 for complex reasoning tasks
2.3 Testing approach
For the use case you prioritized in Step 2, run parallel tests on your actual data:
Example: Product description generator
Run 1,000 real product titles through:
- GPT-4o (baseline: $15/1M output tokens)
- Claude Sonnet ($15/1M output tokens)
- GPT-4o-mini ($0.60/1M output tokens)
- Claude Haiku ($1.25/1M output tokens)
- Llama 3.1 70B ($0.88/1M output tokens)
Measure what actually matters for your business:
- Quality: Manual review of 100 samples, or automated eval against your rubric
- Cost: Actual tokens consumed × model pricing
- Latency: P50, P95, P99 response times
- Consistency: Do outputs vary wildly, or are they stable?
- Edge case handling: How does it perform on your weird/broken/unusual inputs?
Typical outcome:
- 2-3 models meet your quality bar
- The cheapest acceptable model is 60-95% cheaper than your current choice
- One model is surprisingly good (often one you've never heard of)
- The leaderboard winner might not even crack your top 3
Cast a wide net. That obscure model ranked #47 on the leaderboard might be perfect for your use case. The #1 model might be overkill. You won't know until you test on your data.
3. Tune parameters for the final edge
After you've picked the right prompt and model, squeeze out the last 10-20% with parameter tuning.
Key parameters and their impact
Parameter | Range | When to increase | When to decrease | Impact on cost | Impact on quality |
---|---|---|---|---|---|
Temperature | 0.0 - 2.0 | Need creativity, variety, brainstorming | Need consistency, factual accuracy | None | High - controls randomness |
Max tokens | 1 - ∞ | Outputs often truncated | Getting unnecessarily long outputs | Direct - fewer tokens = lower cost | Medium - truncation can hurt quality |
Top P | 0.0 - 1.0 | Want more diverse vocabulary | Want more predictable outputs | None | Medium - controls word choice diversity |
Frequency penalty | -2.0 - 2.0 | Model repeats itself too much | Outputs feel unnatural or disjointed | None | Low - reduces repetition |
Presence penalty | -2.0 - 2.0 | Want model to explore new topics | Want model to stay focused | None | Low - encourages topic diversity |
Stop sequences | Custom strings | Want to truncate at specific markers | Model stops too early | Direct - early stopping = fewer tokens | Low - mostly for formatting |
Parameters interact. Changing temperature affects output length, which affects cost. Test configurations as a whole, not in isolation.
4. Common pitfalls to avoid
4.1 Optimizing without measurement
You can't know if you've improved without baselines Set up tracking (Step 2) before you optimize
4.2 Changing too many variables at once
If you change prompt + model + parameters simultaneously, you won't know what worked Test one variable at a time
4.3 Testing on toy datasets
10 examples won't tell you how the model behaves at scale Use at least 100-500 real samples, ideally production traffic
4.4 Ignoring edge cases
Your model might work great on average but fail catastrophically on 1% of inputs Test the weird stuff, not just the happy path
4.5 Deploying winners too fast
Models behave differently under load Always do gradual rollouts with monitoring
4.6 Stopping after one optimization
You've got 5-10 use cases burning money Build a rhythm: optimize one use case per month
5. You've completed the framework
If you've followed all three steps, you now have:
- Clear objectives (Step 1) - You know what you're optimizing for and who makes decisions
- Cost visibility (Step 2) - You know where every dollar goes and who owns it
- Optimization wins (Step 3) - You've proven you can cut costs 40-90% without sacrificing quality
This is a competitive advantage. While other teams burn through budgets on unoptimized infrastructure, you're delivering better experiences for a fraction of the cost. The efficiency gap compounds. Keep optimizing.
Want to move faster? Narev eliminates the tedious parts—routing, testing, monitoring, deployment. Sign up for the free tier and optimize your first use case today.