Manual Evaluations

Human-powered quality checks—like LLM Arena for your A/B tests.

What are manual evaluations?

Manual evaluation is a quality metric that uses human input to assess response quality. Think of it as LLM Arena for your actual A/B tests.

When do you use manual evaluations?

Use manual evaluations when automated metrics aren't sufficient, or when you need subjective human judgment about quality, relevance, and usefulness.

How do manual evaluations work?

Reviewers directly compare variant responses and score them based on your criteria. Unlike automated metrics, manual evaluation doesn't require expected outputs—humans evaluate what "good" looks like for your use case.

Metrics Introduction