Configuring Metrics

Set up quality, latency, and cost metrics to evaluate your A/B tests.

Configure metrics to measure the performance of your A/B tests. The platform provides built-in quality evaluation methods and supports custom metrics for specialized use cases.

Metric Types

Your A/B tests automatically track three types of metrics:

Quality Metrics: Evaluate output correctness and performance
Latency Metrics: Measure response time
Cost Metrics: Track token usage and API costs

While latency and cost metrics are tracked automatically, you need to configure quality metrics to evaluate how well your variants perform.

Built-in Quality Metrics

The platform provides seven built-in evaluation methods:

LLM Judge (Binary)

Requires: Expected outputs

Pass/fail quality evaluation using an LLM to compare actual outputs against expected results. Best for evaluating open-ended responses where exact matching isn't appropriate.

LLM Judge (Ranking)

Requires: Nothing (works with any prompts)

Ranks variants by quality using LLM evaluation without needing expected outputs. Useful for comparing relative quality when you don't have ground truth data.

Exact Match

Requires: Expected outputs

Compares outputs character-by-character against expected results. Use this when you need strict matching, such as validating specific codes, IDs, or formatted text.

Fuzzy Match

Requires: Expected outputs

Flexible matching with substring detection and keyword overlap scoring. Better than exact match for natural language outputs where phrasing may vary but meaning should align.

Schema Validation

Requires: JSON Schema configuration

Validates that JSON outputs conform to a specified schema structure. Doesn't check values, only validates the structure, types, and required fields.

To configure:

Select "Schema Validation" from the available metrics
Define your JSON Schema in the configuration panel
Click "Validate Schema" to ensure it's valid

Example schema:

{
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "age": { "type": "number" },
    "email": { "type": "string", "format": "email" }
  },
  "required": ["name", "email"]
}

Values Match

Requires: Schema Validation + Expected outputs

Compares JSON values against expected outputs. Must enable Schema Validation first, as it uses the same schema to parse and compare structured data.

Manual Evaluation

Requires: Nothing (works with any prompts)

Enables human review and scoring through the platform interface. Best for subjective quality assessment or when automated metrics aren't sufficient.

Expected Outputs

Several quality metrics require prompts with expected outputs:

LLM Judge (Binary)
Exact Match
Fuzzy Match
Values Match

When you upload prompts via file or import from traces, include an expected_output field. For manual entry, add expected outputs when creating prompts.

If no prompts have expected outputs, metrics that require them will be disabled until you add some.

Custom Metrics

Create custom metrics for domain-specific evaluations that you populate via API.

Creating Custom Metrics

Navigate to the Custom Metrics section
Click "Create Custom Metric"
Enter a metric name following these rules:
- Must start with custom_
- Use lowercase letters, numbers, and underscores only
- Example: custom_eslint_errors, custom_sentiment_score

Submitting Custom Metric Values

Submit custom metric values using the API:

POST /api/applications/{application_id}/quality
Content-Type: application/json
 
{
  "response_id": "gen-...",
  "metric_name": "custom_eslint_errors",
  "value": 4
}

Parameters:

response_id: The ID of the response to score (from generation logs)
metric_name: Your custom metric name
value: Numeric score (higher is typically better)

Custom metrics appear in your A/B test results alongside built-in metrics.

You can enable multiple metrics simultaneously to evaluate different aspects of your variants' performance.

Setting up Variants Running Tests