A/B Testing with the API

Use the Applications API to run A/B tests programmatically.

The Applications API provides an OpenAI-compatible endpoint that enables automatic A/B testing through your production code. Make requests with different configurations, and Narev automatically tracks and compares performance.

Quick Start

Replace your OpenAI base URL with your Narev A/B test endpoint:

from openai import OpenAI
 
client = OpenAI(
    api_key="YOUR_NAREV_API_KEY",
    base_url="https://narev.ai/api/applications/{application_id}/v1"
)
 
response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

That's it! Narev now tracks this request and automatically creates variants for comparison.

Understanding Variants

A variant is a unique combination of:

  1. Model - Including the gateway used to access it
  2. System Prompt - Any system message you provide
  3. Parameters - Temperature, top_p, max_tokens, etc.

Narev automatically creates variants based on your API requests. When you make a request:

  1. Narev checks if a variant exists with that exact configuration
  2. If found, it uses the existing variant
  3. If not found, it creates a new variant

Example:

# First request - creates Variant A
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7
)
 
# Second request - creates Variant B (different model)
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7
)
 
# Third request - reuses Variant A (same configuration)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Different prompt"}],  # User prompt doesn't matter
    temperature=0.7
)

This automatic variant creation enables powerful A/B testing without changing your application logic. Just make API calls with different configurations, then compare performance in the Narev dashboard.

Gateway Prefixes

Narev supports multiple AI providers through gateway prefixes:

{gateway}:{model_name}

Available Gateways:

  • openai - OpenAI direct
  • openrouter - OpenRouter aggregator

Examples:

  • openai:gpt-4 - OpenAI's GPT-4
  • openrouter:anthropic/claude-3-opus - Claude via OpenRouter
  • anthropic:claude-3-sonnet-20240229 - Direct Anthropic
  • openrouter:meta-llama/llama-3.1-70b-instruct - Llama via OpenRouter

Why Gateway Prefixes Matter

The same model accessed through different gateways creates separate variants. This lets you compare:

  • Latency - Which gateway is faster?
  • Cost - Which is more economical?
  • Reliability - Which has better uptime?
# Test GPT-4 via OpenAI
response1 = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
 
# Test GPT-4 via OpenRouter
response2 = client.chat.completions.create(
    model="openrouter:openai/gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Both requests create separate variants, enabling side-by-side comparison in your dashboard.

Default behavior: If you omit the gateway prefix (e.g., just gpt-4), Narev defaults to the native provider. However, we recommend always using explicit gateway prefixes for clarity.

Comparing Models

Test different models by making requests with different model parameters:

# Variant 1: GPT-4
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
 
# Variant 2: Claude 3 Opus
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
 
# Variant 3: Llama 3.1
client.chat.completions.create(
    model="openrouter:meta-llama/llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

View comparative metrics (latency, cost, quality) in your Narev dashboard.

Testing System Prompts

System prompts are part of variant configuration, so different system messages create different variants:

# Variant 1: Concise assistant
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "system", "content": "You are a concise assistant. Keep responses under 50 words."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)
 
# Variant 2: Detailed assistant
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "system", "content": "You are a detailed assistant. Provide comprehensive explanations."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)

Testing Parameters

Different generation parameters create different variants:

# Variant 1: Conservative (low temperature)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Write a product description"}],
    temperature=0.3,
    max_tokens=100
)
 
# Variant 2: Creative (high temperature)
client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": "Write a product description"}],
    temperature=0.9,
    max_tokens=100
)

Quality Evaluation

Include expected_output in metadata to enable automatic quality evaluation:

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    extra_body={
        "metadata": {
            "expected_output": "Paris is the capital of France."
        }
    }
)

When expected_output is provided, Narev will:

  1. Automatically evaluate quality by comparing the model's response to the expected output
  2. Track quality metrics across variants in your dashboard
  3. Enable quality-based comparisons to find the best-performing configuration

Include expected outputs for a subset of your production requests to continuously monitor quality and ensure model changes don't harm accuracy.

You can also include custom metadata for filtering and analysis:

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}],
    extra_body={
        "metadata": {
            "expected_output": "Expected answer here",
            "user_id": "user_123",
            "session_id": "session_456",
            "category": "customer_support"
        }
    }
)

Production Variants

You can set a production variant in the Narev UI. When set:

  • Requests without a model use the production variant's configuration
  • Requests with a model create/use a new variant with your specified model

This allows you to:

  1. Use a stable production configuration by default
  2. Override it when testing new models or parameters
# Uses production variant (no model specified)
client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello"}]
)
 
# Creates new variant for testing (overrides production)
client.chat.completions.create(
    model="anthropic:claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello"}]
)

Next Steps

  • See the Applications API Reference for complete endpoint documentation
  • Learn about Data Sources for other ways to collect test data
  • View your variants and compare performance in the Narev dashboard