Configuring Metrics
Set up quality, latency, and cost metrics to evaluate your A/B tests.
Configure metrics to measure the performance of your A/B tests. The platform provides built-in quality evaluation methods and supports custom metrics for specialized use cases.
Metric Types
Your A/B tests automatically track three types of metrics:
- Quality Metrics: Evaluate output correctness and performance
- Latency Metrics: Measure response time
- Cost Metrics: Track token usage and API costs
While latency and cost metrics are tracked automatically, you need to configure quality metrics to evaluate how well your variants perform.
Built-in Quality Metrics
The platform provides seven built-in evaluation methods:
LLM Judge (Binary)
Requires: Expected outputs
Pass/fail quality evaluation using an LLM to compare actual outputs against expected results. Best for evaluating open-ended responses where exact matching isn't appropriate.
LLM Judge (Ranking)
Requires: Nothing (works with any prompts)
Ranks variants by quality using LLM evaluation without needing expected outputs. Useful for comparing relative quality when you don't have ground truth data.
Exact Match
Requires: Expected outputs
Compares outputs character-by-character against expected results. Use this when you need strict matching, such as validating specific codes, IDs, or formatted text.
Fuzzy Match
Requires: Expected outputs
Flexible matching with substring detection and keyword overlap scoring. Better than exact match for natural language outputs where phrasing may vary but meaning should align.
Schema Validation
Requires: JSON Schema configuration
Validates that JSON outputs conform to a specified schema structure. Doesn't check values, only validates the structure, types, and required fields.
To configure:
- Select "Schema Validation" from the available metrics
- Define your JSON Schema in the configuration panel
- Click "Validate Schema" to ensure it's valid
Example schema:
{
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "number" },
"email": { "type": "string", "format": "email" }
},
"required": ["name", "email"]
}Values Match
Requires: Schema Validation + Expected outputs
Compares JSON values against expected outputs. Must enable Schema Validation first, as it uses the same schema to parse and compare structured data.
Manual Evaluation
Requires: Nothing (works with any prompts)
Enables human review and scoring through the platform interface. Best for subjective quality assessment or when automated metrics aren't sufficient.
Expected Outputs
Several quality metrics require prompts with expected outputs:
- LLM Judge (Binary)
- Exact Match
- Fuzzy Match
- Values Match
When you upload prompts via file or import from traces, include an expected_output field. For manual entry, add expected outputs when creating prompts.
If no prompts have expected outputs, metrics that require them will be disabled until you add some.
Custom Metrics
Create custom metrics for domain-specific evaluations that you populate via API.
Creating Custom Metrics
- Navigate to the Custom Metrics section
- Click "Create Custom Metric"
- Enter a metric name following these rules:
- Must start with
custom_ - Use lowercase letters, numbers, and underscores only
- Example:
custom_eslint_errors,custom_sentiment_score
- Must start with
Submitting Custom Metric Values
Submit custom metric values using the API:
POST /api/applications/{application_id}/quality
Content-Type: application/json
{
"response_id": "gen-...",
"metric_name": "custom_eslint_errors",
"value": 4
}Parameters:
response_id: The ID of the response to score (from generation logs)metric_name: Your custom metric namevalue: Numeric score (higher is typically better)
Custom metrics appear in your A/B test results alongside built-in metrics.
You can enable multiple metrics simultaneously to evaluate different aspects of your variants' performance.