A/B Testing with the API
Use the Applications API to run A/B tests programmatically.
The Applications API provides an OpenAI-compatible endpoint that enables automatic A/B testing through your production code. Make requests with different configurations, and Narev automatically tracks and compares performance.
Quick Start
Replace your OpenAI base URL with your Narev A/B test endpoint:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_NAREV_API_KEY",
base_url="https://narev.ai/api/applications/{application_id}/v1"
)
response = client.chat.completions.create(
model="openai:gpt-4",
messages=[
{"role": "user", "content": "What is the capital of France?"}
]
)That's it! Narev now tracks this request and automatically creates variants for comparison.
Understanding Variants
A variant is a unique combination of:
- Model - Including the gateway used to access it
- System Prompt - Any system message you provide
- Parameters - Temperature, top_p, max_tokens, etc.
Narev automatically creates variants based on your API requests. When you make a request:
- Narev checks if a variant exists with that exact configuration
- If found, it uses the existing variant
- If not found, it creates a new variant
Example:
# First request - creates Variant A
client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7
)
# Second request - creates Variant B (different model)
client.chat.completions.create(
model="anthropic:claude-3-opus-20240229",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7
)
# Third request - reuses Variant A (same configuration)
client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": "Different prompt"}], # User prompt doesn't matter
temperature=0.7
)This automatic variant creation enables powerful A/B testing without changing your application logic. Just make API calls with different configurations, then compare performance in the Narev dashboard.
Gateway Prefixes
Narev supports multiple AI providers through gateway prefixes:
{gateway}:{model_name}
Available Gateways:
openai- OpenAI directopenrouter- OpenRouter aggregator
Examples:
openai:gpt-4- OpenAI's GPT-4openrouter:anthropic/claude-3-opus- Claude via OpenRouteranthropic:claude-3-sonnet-20240229- Direct Anthropicopenrouter:meta-llama/llama-3.1-70b-instruct- Llama via OpenRouter
Why Gateway Prefixes Matter
The same model accessed through different gateways creates separate variants. This lets you compare:
- Latency - Which gateway is faster?
- Cost - Which is more economical?
- Reliability - Which has better uptime?
# Test GPT-4 via OpenAI
response1 = client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Test GPT-4 via OpenRouter
response2 = client.chat.completions.create(
model="openrouter:openai/gpt-4",
messages=[{"role": "user", "content": prompt}]
)Both requests create separate variants, enabling side-by-side comparison in your dashboard.
Default behavior: If you omit the gateway prefix (e.g., just gpt-4), Narev defaults to the native provider.
However, we recommend always using explicit gateway prefixes for clarity.
Comparing Models
Test different models by making requests with different model parameters:
# Variant 1: GPT-4
client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Variant 2: Claude 3 Opus
client.chat.completions.create(
model="anthropic:claude-3-opus-20240229",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Variant 3: Llama 3.1
client.chat.completions.create(
model="openrouter:meta-llama/llama-3.1-70b-instruct",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)View comparative metrics (latency, cost, quality) in your Narev dashboard.
Testing System Prompts
System prompts are part of variant configuration, so different system messages create different variants:
# Variant 1: Concise assistant
client.chat.completions.create(
model="openai:gpt-4",
messages=[
{"role": "system", "content": "You are a concise assistant. Keep responses under 50 words."},
{"role": "user", "content": "What is machine learning?"}
]
)
# Variant 2: Detailed assistant
client.chat.completions.create(
model="openai:gpt-4",
messages=[
{"role": "system", "content": "You are a detailed assistant. Provide comprehensive explanations."},
{"role": "user", "content": "What is machine learning?"}
]
)Testing Parameters
Different generation parameters create different variants:
# Variant 1: Conservative (low temperature)
client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": "Write a product description"}],
temperature=0.3,
max_tokens=100
)
# Variant 2: Creative (high temperature)
client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": "Write a product description"}],
temperature=0.9,
max_tokens=100
)Quality Evaluation
Include expected_output in metadata to enable automatic quality evaluation:
response = client.chat.completions.create(
model="openai:gpt-4",
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
extra_body={
"metadata": {
"expected_output": "Paris is the capital of France."
}
}
)When expected_output is provided, Narev will:
- Automatically evaluate quality by comparing the model's response to the expected output
- Track quality metrics across variants in your dashboard
- Enable quality-based comparisons to find the best-performing configuration
Include expected outputs for a subset of your production requests to continuously monitor quality and ensure model changes don't harm accuracy.
You can also include custom metadata for filtering and analysis:
response = client.chat.completions.create(
model="openai:gpt-4",
messages=[{"role": "user", "content": prompt}],
extra_body={
"metadata": {
"expected_output": "Expected answer here",
"user_id": "user_123",
"session_id": "session_456",
"category": "customer_support"
}
}
)Production Variants
You can set a production variant in the Narev UI. When set:
- Requests without a model use the production variant's configuration
- Requests with a model create/use a new variant with your specified model
This allows you to:
- Use a stable production configuration by default
- Override it when testing new models or parameters
# Uses production variant (no model specified)
client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}]
)
# Creates new variant for testing (overrides production)
client.chat.completions.create(
model="anthropic:claude-3-opus-20240229",
messages=[{"role": "user", "content": "Hello"}]
)Next Steps
- See the Applications API Reference for complete endpoint documentation
- Learn about Data Sources for other ways to collect test data
- View your variants and compare performance in the Narev dashboard