Benchmarking with API

Use the Applications API to run A/B tests programmatically.

The Applications API provides an OpenAI-compatible endpoint that enables automatic benchmarking through your production code. Make requests with different configurations, and Narev automatically tracks and compares performance.

Quick Start

Replace your OpenAI base URL with your Narev A/B test endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_NAREV_API_KEY",
    base_url="https://narev.ai/api/applications/{benchmark_id}/v1"
)

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

That's it! Narev now will add every prompt sent to this endpoint to your benchmark.

Selecting a model provider or gateway

Narev supports multiple AI providers/gateways through gateway prefixes:

{gateway}:{model_name}

Available Gateways:

  • openai - OpenAI direct
  • openrouter - OpenRouter aggregator
  • nvidia - NVIDIA NIM models
  • kilo - Kilo Code models
  • github - Github models

Examples:

  • openai:gpt-4 - OpenAI's GPT-4
  • openrouter:anthropic/claude-3-opus - Claude via OpenRouter
  • anthropic:claude-3-sonnet-20240229 - Direct Anthropic
  • openrouter:meta-llama/llama-3.1-70b-instruct - Llama via OpenRouter

The same model accessed through different gateways is treated as a separate variant. This lets you compare:

  • Latency - Which gateway is faster?
  • Cost - Which is more economical?
  • Reliability - Which has better uptime?
# Test GPT-4 via OpenAI
response1 = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

# Test GPT-4 via OpenRouter
response2 = client.chat.completions.create(
    model="openrouter:openai/gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

Default behavior: If you omit the gateway prefix (e.g., just gpt-4), Narev defaults to the native provider. However, we recommend always using explicit gateway prefixes for clarity.

Include expected outputs for a subset of your production requests to continuously monitor quality and ensure model changes don't harm accuracy.

You can also include custom metadata for filtering and analysis:

response = client.chat.completions.create(
    model="openai:gpt-4",
    messages=[{"role": "user", "content": prompt}],
    extra_body={
        "metadata": {
            "expected_output": "Expected answer here",
            "user_id": "user_123",
            "session_id": "session_456",
            "category": "customer_support"
        }
    }
)

Still have questions? Ask on Discord