Manual Evaluations

Set up human review to compare variant outputs and assess quality.

Use human reviewers to compare and score variant outputs. Learn how to set up manual evaluation workflows for your A/B tests.

How Manual Evaluations Work

Manual evaluation provides a simple arena-style interface where you review individual prompt-response pairs and rate them as either good or bad responses. This helps you assess quality through human judgment when automated metrics aren't sufficient.

Evaluation Interface

When you start a manual evaluation:

Review the Request - See the complete prompt, including any conversation history (previous messages can be expanded/collapsed)
Read the Response - Review the model's output for that prompt
Rate the Quality - Click either:
- Good Response - The output meets quality standards
- Bad Response - The output is unsatisfactory
Progress Automatically - After rating, the interface advances to the next prompt

The evaluation dialog shows your progress (e.g., "5 of 20") and allows you to navigate back to review or change previous ratings.

Finding Experiments to Evaluate

The Evaluations page lists all experiments that have manual evaluation metrics enabled. For each experiment, you'll see:

Evaluation Progress - How many prompts have been evaluated out of the total
Progress Bar - Visual indicator of completion status
Action Button - "Evaluate" to start rating, or "Review" to revisit completed evaluations

Once all prompts are evaluated, you can use the manual evaluation scores as a quality metric when comparing variants in your A/B test results.

Running Tests Analyzing Results