A/B Testing with the API

Use the Applications API to run A/B tests programmatically.

The Applications API provides an OpenAI-compatible endpoint that enables automatic A/B testing through your production code. Make requests with different configurations, and Narev automatically tracks and compares performance.

Quick Start

Replace your OpenAI base URL with your Narev A/B test endpoint:

from openai import OpenAI
 
client = OpenAI(
    api_key="YOUR_NAREV_API_KEY",
    base_url="https://narev.ai/api/applications/{application_id}/v1"
)
 
response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

That's it! Narev now tracks this request and automatically creates variants for comparison.

Understanding Variants

A variant is a unique combination of:

Model - Including the gateway used to access it
System Prompt - Any system message you provide
Parameters - Temperature, top_p, max_tokens, etc.

Narev automatically creates variants based on your API requests. When you make a request:

Narev checks if a variant exists with that exact configuration
If found, it uses the existing variant
If not found, it creates a new variant

Example:

# First request - creates Variant A
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7
)
 
# Second request - creates Variant B (different model)
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7
)
 
# Third request - reuses Variant A (same configuration)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Different prompt"}],  # User prompt doesn't matter
    temperature=0.7
)

This automatic variant creation enables powerful A/B testing without changing your application logic. Just make API calls with different configurations, then compare performance in the Narev dashboard.

Gateway Prefixes

Narev supports multiple AI providers through gateway prefixes:

{gateway}:{model_name}

Available Gateways:

openai - OpenAI direct
openrouter - OpenRouter aggregator

Examples:

openai:gpt-4 - OpenAI's GPT-4
openrouter:anthropic/claude-3-opus - Claude via OpenRouter
anthropic:claude-3-sonnet-20240229 - Direct Anthropic
openrouter:meta-llama/llama-3.1-70b-instruct - Llama via OpenRouter

Why Gateway Prefixes Matter

The same model accessed through different gateways creates separate variants. This lets you compare:

Latency - Which gateway is faster?
Cost - Which is more economical?
Reliability - Which has better uptime?

# Test GPT-4 via OpenAI
response1 = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
 
# Test GPT-4 via OpenRouter
response2 = client.chat.completions.create(
    model="openrouter:openai/gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Both requests create separate variants, enabling side-by-side comparison in your dashboard.

Default behavior: If you omit the gateway prefix (e.g., just gpt-4), Narev defaults to the native provider. However, we recommend always using explicit gateway prefixes for clarity.

Comparing Models

Test different models by making requests with different model parameters:

# Variant 1: GPT-4
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
 
# Variant 2: Claude 3 Opus
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
 
# Variant 3: Llama 3.1
client.chat.completions.create(
    model="openrouter:meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

View comparative metrics (latency, cost, quality) in your Narev dashboard.

Testing System Prompts

System prompts are part of variant configuration, so different system messages create different variants:

# Variant 1: Concise assistant
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "system", "content": "You are a concise assistant. Keep responses under 50 words."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)
 
# Variant 2: Detailed assistant
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "system", "content": "You are a detailed assistant. Provide comprehensive explanations."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

Testing Parameters

Different generation parameters create different variants:

# Variant 1: Conservative (low temperature)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Write a product description"}],
    temperature=0.3,
    max_tokens=100
)
 
# Variant 2: Creative (high temperature)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Write a product description"}],
    temperature=0.9,
    max_tokens=100
)

Quality Evaluation

Include expected_output in metadata to enable automatic quality evaluation:

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    extra_body={
        "metadata": {
            "expected_output": "Paris is the capital of France."
        }
    }
)

When expected_output is provided, Narev will:

Automatically evaluate quality by comparing the model's response to the expected output
Track quality metrics across variants in your dashboard
Enable quality-based comparisons to find the best-performing configuration

Include expected outputs for a subset of your production requests to continuously monitor quality and ensure model changes don't harm accuracy.

You can also include custom metadata for filtering and analysis:

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}],
    extra_body={
        "metadata": {
            "expected_output": "Expected answer here",
            "user_id": "user_123",
            "session_id": "session_456",
            "category": "customer_support"
        }
    }
)

Production Variants

You can set a production variant in the Narev UI. When set:

Requests without a model use the production variant's configuration
Requests with a model create/use a new variant with your specified model

This allows you to:

Use a stable production configuration by default
Override it when testing new models or parameters

# Uses production variant (no model specified)
client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello"}]
)
 
# Creates new variant for testing (overrides production)
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello"}]
)

Next Steps

See the Applications API Reference for complete endpoint documentation
Learn about Data Sources for other ways to collect test data
View your variants and compare performance in the Narev dashboard

Analyzing Results Introduction