Manual Evaluations
Set up human review to compare variant outputs and assess quality.
Use human reviewers to compare and score variant outputs. Learn how to set up manual evaluation workflows for your A/B tests.
How Manual Evaluations Work
Manual evaluation provides a simple arena-style interface where you review individual prompt-response pairs and rate them as either good or bad responses. This helps you assess quality through human judgment when automated metrics aren't sufficient.
Evaluation Interface
When you start a manual evaluation:
- Review the Request - See the complete prompt, including any conversation history (previous messages can be expanded/collapsed)
- Read the Response - Review the model's output for that prompt
- Rate the Quality - Click either:
- Good Response - The output meets quality standards
- Bad Response - The output is unsatisfactory
- Progress Automatically - After rating, the interface advances to the next prompt
The evaluation dialog shows your progress (e.g., "5 of 20") and allows you to navigate back to review or change previous ratings.
Finding Experiments to Evaluate
The Evaluations page lists all experiments that have manual evaluation metrics enabled. For each experiment, you'll see:
- Evaluation Progress - How many prompts have been evaluated out of the total
- Progress Bar - Visual indicator of completion status
- Action Button - "Evaluate" to start rating, or "Review" to revisit completed evaluations
Once all prompts are evaluated, you can use the manual evaluation scores as a quality metric when comparing variants in your A/B test results.