Narev

Narev is a platform for optimizing Gen AI apps

Faster, cheaper, better.

Academic benchmarks are gameable

Models pass tests without reading them.
Test answers leak into training.
Change one number, accuracy drops 65%.

So why not run one yourself?

And stop writing evals forever

Benchmarks test:

Can it answer question 47?
Does it match this regex?
What's the BLEU score?

You should test:

Conversion rate
Time to resolution
ESLint errors
User satisfaction
Retry rate
Net Promoter Score
Revenue impact

These only exist in production. A/B test to measure them.

Step 1: Connect your stack.

Yes. We integrate. Enter credentials, we've got the rest.

Works with your stack
OpenAI, Anthropic, AWS Bedrock, LangSmith, OpenRouter - if you use it, we support it.
No setup required
We pull data from where it lives. Your team does nothing.

Direct Provider

OpenAI
ElevenLabs
Anthropic
Midjourney
AWS
Azure
GCP
Cohere
Mistral

Gateways

LiteLLM
OpenRouter
Portkey
Helicone Gateway
AWS Bedrock
Vertex AI

Traces

Helicone
Langfuse
LangSmith
Weights and Biases
Helicone
Langfuse
LangSmith
Weights and Biases

Imports

JSON
JSONL
CSV
JSON
JSONL
CSV

Or call our gateway directly.

Step 2: Define a variant.
Or pick from our library.

Run different LLM configurations and see the impact instantly.

Define a Variant

Clone a configuration to get started quickly

gpt-4-turbo-preview
You are a helpful assistant...
0.7
2048

Variant Library

Define model, provider, system prompt, parameters.

GitHub Copilot

Clone

GPT-5

Base44

base44

Clone

Claude 3.5 Sonnet

v0

Clone

Claude 3.7 Sonnet

Lovable

Clone

Claude 3.7 Sonnet

Step 3: Hit run.
Skip the evals.

A/B test the variant. The only true benchmark is your production data.

Test NamePrice ImpactQuality ImpactLatency ImpactRecommendation
System Prompt Optimization
GPT-4 vs Claude-3
Max Tokens 1000 vs 2000
Temperature 0.1 vs 0.7
Prompt Engineering Test

Finally, optimize

Visualize the cost-quality tradeoffs across different model variants. Find the sweet spot for your use case.

Cost vs Quality

GSM8K Accuracy

Ways to Optimize

Click to highlight different optimization strategies

Best output quality regardless of cost. Choose when quality is the top priority and budget is flexible.

Selected: Claude Haiku 4.5
Cost: $68250.00
Quality: 100.0%

Support for every modality

Text, audio, image, and video - we handle it all

Text

Build faster code agents by routing between GPT-4 for complex logic and Claude for quick refactors.

Audio

Reduce latency for real-time transcription. Route to Deepgram for speed, Whisper for accuracy.

Image

Get the best image output immediately. Test DALL-E, Midjourney, and Stable Diffusion in parallel.

Video

Compare all video providers side-by-side. Find which model delivers the quality you need, faster.

Why not take this further and build a router?

v0 uses routing to stay fast.
Cursor needs it for reliability.
OpenAI built one for cost control.

The idea is simple.

IF simple query THEN fast model
ELSE complex model.

Single Endpoint

Baseline
Opus100.0%
Avg Latency0.0s
Cost per 1K queries$0.00

With Router

Haiku0.0%
Sonnet0.0%
Opus0.0%
Avg Latency0.0s
Cost per 1K queries$0.00

Do you want to see the numbers first?

Narev has an open source observability tool for LLM and Cloud costs.