Run your first experiment

Quick start guide to running your first LLM experiment with Narev.

Overview

This guide walks you through running your first experiment in Narev using the pre-configured HellaSwag Default Experiment. This experiment compares GPT-4o Mini against Claude-3.5 Haiku on the HellaSwag dataset, evaluating them across cost, latency, and quality metrics.

The HellaSwag dataset tests commonsense reasoning by presenting multiple-choice questions where models must predict the most likely continuation of a scenario.

Starting Your First Experiment

Initial State

When you first log in to Narev, you'll see the experiment dashboard in its initial state with no results yet. Hit the button which will take you to the default experiment.

No Results Yet

Understanding the Experiment Setup

Before running the experiment, you can review its configuration by clicking on the experiment details.

Run Experiment

The experiment setup page shows:

Variants (2 configured)

GPT-4o Mini (openai/gpt-4o-mini)

Temperature: 0.7
System prompt configured for multiple-choice questions

Claude-3.5 Haiku (anthropic/claude-3-haiku)

Temperature: 0.7
Same system prompt for consistency

Evaluation Metrics (5 configured)

Cost Metrics

Cost: Tracks API costs and token usage

Latency Metrics

Total Latency: Measures total response time from request to completion
Time to First Token: Measures time until the first token is received
Tokens per Second: Measures token generation speed and throughput

Quality Metrics

Rule-Based Quality: Evaluates quality using predefined regex pattern

All variants use the same system prompt: "You are answering multiple choice questions. Always respond with ONLY the letter of the correct answer (A, B, C, or D). Do not include any explanation or additional text."

Click "Run Experiment" in the top right to start the evaluation.

Experiment Execution

Once you start the experiment, it enters the execution queue.

Experiment Queued

The experiment status shows "Queued for execution" with a progress indicator at 0%. Narev will process each prompt across both model variants and collect performance data.

Calculating Results

After all prompts are processed, Narev calculates the evaluation metrics.

Experiment Metrics

The status updates to "Calculating metrics..." at 100%. This phase aggregates results, computes statistics, and prepares the comparison analysis.

Experiment Completion

When processing is complete, the experiment status changes to "COMPLETE".

Experiment Completed

You now have access to comprehensive results and analysis. Several action buttons become available:

Duplicate: Create a copy of this experiment configuration
Edit & Rerun: Modify variants and metrics and run again
Rerun: Execute the same experiment again
Run Experiment: Start a new run

Analyzing Your Results

Scroll down the experiment page to see the experiments summary.

Experiment Impact Summary

The Experiment Impact section provides a high-level overview comparing the best and worst performing variants.

Experiment Impact

In this example, GPT-4o Mini shows significant improvements over Claude-3.5 Haiku:

Cost: 49% improvement (lower cost per million requests)
Latency: 13% improvement (faster response times)
Quality: 33% improvement (better answer accuracy)

These metrics help you quickly identify which model variant performs better across different dimensions.

Detailed Variant Analysis

The Variant Analysis section provides deeper insights into performance tradeoffs.

Experiment Variants

Performance Comparison Charts:

Blended Cost vs Latency: Shows the relationship between cost efficiency and response speed
Blended Cost vs Quality: Illustrates the tradeoff between cost and output quality
Quality vs Latency: Displays how quality correlates with response time

You can click "Set as Baseline" to make GPT-4o Mini the comparison baseline for future experiments.

Prompt-Level Performance

The Prompt Analysis view breaks down performance for individual prompts in your dataset.

Experiment Prompt

This view shows detailed metrics for each prompt tested:

Prompt: The system prompt and question asked
Variants: Which models were tested (Claude-3.5 Haiku and GPT-4o Mini)
Avg Cost: Average cost per million requests for this prompt
Avg Quality: Quality score percentage for this prompt
Avg Latency: Average response time in milliseconds

The analysis shows 5 prompts, all using the same system prompt for multiple-choice questions. Quality scores range from 0% to 100%, helping you identify which types of questions each model handles best.

Individual Prompt Details

Click on the expand icon (red) of any prompt to see the complete conversation and model responses.

Experiment Prompt Detail

This detailed view helps you understand:

How each model interprets the same prompt
Whether responses match expected answers
Performance differences at the individual prompt level
Which model is more cost-effective for specific types of questions

Next Steps

Now that you've successfully run your first experiment, you can:

Create custom experiments with your own prompts and datasets
Add more model variants to compare additional LLMs
Configure custom evaluation metrics tailored to your use case
Set up automated experiments to continuously evaluate model performance

How Narev fits your stack?Overview