Run your first experiment
Quick start guide to running your first LLM experiment with Narev.
Overview
This guide walks you through running your first experiment in Narev using the pre-configured HellaSwag Default Experiment. This experiment compares GPT-4o Mini against Claude-3.5 Haiku on the HellaSwag dataset, evaluating them across cost, latency, and quality metrics.
The HellaSwag dataset tests commonsense reasoning by presenting multiple-choice questions where models must predict the most likely continuation of a scenario.
Starting Your First Experiment
Initial State
When you first log in to Narev, you'll see the experiment dashboard in its initial state with no results yet. Hit the button which will take you to the default experiment.
Understanding the Experiment Setup
Before running the experiment, you can review its configuration by clicking on the experiment details.
The experiment setup page shows:
Variants (2 configured)
GPT-4o Mini (openai/gpt-4o-mini
)
- Temperature: 0.7
- System prompt configured for multiple-choice questions
Claude-3.5 Haiku (anthropic/claude-3-haiku
)
- Temperature: 0.7
- Same system prompt for consistency
Evaluation Metrics (5 configured)
Cost Metrics
- Cost: Tracks API costs and token usage
Latency Metrics
- Total Latency: Measures total response time from request to completion
- Time to First Token: Measures time until the first token is received
- Tokens per Second: Measures token generation speed and throughput
Quality Metrics
- Rule-Based Quality: Evaluates quality using predefined regex pattern
All variants use the same system prompt: "You are answering multiple choice questions. Always respond with ONLY the letter of the correct answer (A, B, C, or D). Do not include any explanation or additional text."
Click "Run Experiment" in the top right to start the evaluation.
Experiment Execution
Once you start the experiment, it enters the execution queue.
The experiment status shows "Queued for execution" with a progress indicator at 0%. Narev will process each prompt across both model variants and collect performance data.
Calculating Results
After all prompts are processed, Narev calculates the evaluation metrics.
The status updates to "Calculating metrics..." at 100%. This phase aggregates results, computes statistics, and prepares the comparison analysis.
Experiment Completion
When processing is complete, the experiment status changes to "COMPLETE".
You now have access to comprehensive results and analysis. Several action buttons become available:
- Duplicate: Create a copy of this experiment configuration
- Edit & Rerun: Modify variants and metrics and run again
- Rerun: Execute the same experiment again
- Run Experiment: Start a new run
Analyzing Your Results
Scroll down the experiment page to see the experiments summary.
Experiment Impact Summary
The Experiment Impact section provides a high-level overview comparing the best and worst performing variants.
In this example, GPT-4o Mini shows significant improvements over Claude-3.5 Haiku:
- Cost: 49% improvement (lower cost per million requests)
- Latency: 13% improvement (faster response times)
- Quality: 33% improvement (better answer accuracy)
These metrics help you quickly identify which model variant performs better across different dimensions.
Detailed Variant Analysis
The Variant Analysis section provides deeper insights into performance tradeoffs.
Performance Comparison Charts:
- Blended Cost vs Latency: Shows the relationship between cost efficiency and response speed
- Blended Cost vs Quality: Illustrates the tradeoff between cost and output quality
- Quality vs Latency: Displays how quality correlates with response time
You can click "Set as Baseline" to make GPT-4o Mini the comparison baseline for future experiments.
Prompt-Level Performance
The Prompt Analysis view breaks down performance for individual prompts in your dataset.
This view shows detailed metrics for each prompt tested:
- Prompt: The system prompt and question asked
- Variants: Which models were tested (Claude-3.5 Haiku and GPT-4o Mini)
- Avg Cost: Average cost per million requests for this prompt
- Avg Quality: Quality score percentage for this prompt
- Avg Latency: Average response time in milliseconds
The analysis shows 5 prompts, all using the same system prompt for multiple-choice questions. Quality scores range from 0% to 100%, helping you identify which types of questions each model handles best.
Individual Prompt Details
Click on the expand icon (red) of any prompt to see the complete conversation and model responses.
This detailed view helps you understand:
- How each model interprets the same prompt
- Whether responses match expected answers
- Performance differences at the individual prompt level
- Which model is more cost-effective for specific types of questions
Next Steps
Now that you've successfully run your first experiment, you can:
- Create custom experiments with your own prompts and datasets
- Add more model variants to compare additional LLMs
- Configure custom evaluation metrics tailored to your use case
- Set up automated experiments to continuously evaluate model performance