How to choose an LLM (with labelled data)
Guide to selecting the best LLM for your product (with labelled data)
There are hundreds of models available. This guide helps you find the right one for you. This guide assumes that you have labelled data to make the decision (examples or past usage). If your data is not labelled check this guide instead.
Interactive demo
Introduction to Custom Benchmarks
There is no shortcut. You need to run a benchmark or an A/B test to be able to tell which model to use.
Here we'll walk through how to set it up:
- We'll send your example prompts to a State of the Art model to get the baseline using a Jupyter notebook
- We'll use the Narev platform to find a model that performs just like a baseline, but with a lower cost and better latency
- We'll deploy the optimal model as a production model
But before going there, we'll address a very common question...
Why not use an LLM leaderboard instead?
LLM leaderboards show results from academic benchmarks. There are three reasons not to use academic benchmarks when making your decision on which model to use:
- Academic benchmarks are a poor approximation of your application. The leaderboards show scores that LLM's get on a standardized test. The promise that the benchmark authors give is that if a model performs well on a standardized test, then it will also perform well in your production. But the promise is just a promise, and it doesn't always hold. We've written about it on our blog where we show how a 3 year-old models matches the performance of the current state of the art model (despite drastically different results in benchmarks).
- Academic benchmarks are not repeatable. The results of a benchmark vary depending on who run it. For example, GPT-4's score on the MMLU-Pro benchmark varies from
72.6%to85.7%, depending on which leaderboard you check. We wrote extensively about it on our blog. - Academic benchmarks are being gamed. - For example, chatbot arena are manipulated as described in a major study by Singh et al. (2025). Data contamination is rampant (models that use benchmark questions in their training process). Read more about it in our Docs on A/B testing.
Now, your benchmark, measured by you fully reflects your use case and can't be gamed. Worth setting it up.
Before you begin
To complete this guide, you'll need:
- A Narev account (sign up here) and a Narev API key (get one here)
- Python 3.8+ installed locally or access to Google Colab/Kaggle
- 50-100 example prompts representative of your production use case
- Basic familiarity with Python and Jupyter notebooks
- ~30 minutes to complete the setup and benchmarking
If you prefer not to use Python, you can upload prompts directly via the Narev dashboard or import from tracing systems. See data source options.
Setting up your benchmark
Part 1: Create an A/B test endpoint on Narev platform
Create new application and select Live Test as your data source. You should get a URL that looks like this:
https://www.narev.ai/api/applications/<YOUR_TEST_ID>/v1
This is an endpoint that you can use like any OpenAI API compliant gateway (think OpenAI, OpenRouter, LiteLLM, Helicone, etc). It has ~500 models built-in.
In addition to giving you a response, it also saves your request for an A/B test. Neat.
Part 2: Open and prepare the environment
We created a short Kaggle and Google Colab notebooks to help you get started. In this notebook, we'll use GSM8K benchmark data from HuggingFace.
We start from importing the libraries and setting the credentials:
# Install required packages
import pandas as pd
import requests
from datasets import load_dataset
from kaggle_secrets import UserSecretsClient # or from google.colab import userdata
# Secrets
user_secrets = UserSecretsClient()
NAREV_API_KEY = user_secrets.get_secret("narev-api-key") # or userdata.get('narev-api-key')
NAREV_BASE_URL = "https://www.narev.ai/api/applications/cmj6nt7g40002i1qymlqvct4z/v1" # TODO: replace with your URL from Part 1We'll also define the system prompt and the state-of-art model to make the comparisons with.
# Variables
SYSTEM_PROMPT="""Solve the following math problem. Only include the number in your answer.
Give your final as a number.
Do not include any reasoning"""
STATE_OF_ART_MODEL="openrouter:google/gemini-3-pro-preview"The system prompt is very simple here. We can iterate on the system prompts in the Narev platform to optimize further.
Finally, we'll load the gsm8k dataset. Here you should load the actual prompts that you would be using in your application.
dataset = load_dataset("openai/gsm8k", "main")
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])
# take the first 50 prompts from the test
df_train = df_train.head(50)
# prepare expected output
df_train['expected_output'] = df_train['answer'].str.extract(r'####\s*(-?\d+(?:,\d+)*(?:\.\d+)?)')
df_train['expected_output'] = df_train['expected_output'].str.replace(',', '', regex=False).astype(float)
print(f"First question:\n\n{df_train.iloc[0]['question']}")
print(f"Expected output:\n\n{df_train.iloc[0]['expected_output']}")Last thing is defining a helper function that will help us send the requests to Narev:
def send_to_llm_with_expected_output(question: str, expected_output: str) -> dict:
messages = []
if SYSTEM_PROMPT:
messages.append({"role": "system", "content": SYSTEM_PROMPT})
messages.append({"role": "user", "content": question})
try:
response = requests.post(
url=f"{NAREV_BASE_URL}/chat/completions",
headers={
"Authorization": f"Bearer {NAREV_API_KEY}",
"Content-Type": "application/json"
},
json={
"model": STATE_OF_ART_MODEL,
"messages": messages,
"metadata": {"expected_output": expected_output}
},
timeout=60
)
response.raise_for_status()
data = response.json()
return data
except Exception as e:
print(e)
return {}
Part 3: Send the requests to Narev
All that's left to do here is sending the requests to Narev. We'll select the state of the art model for this.
for index, row in df_train.iterrows():
question = row['question']
expected_output = str(row['expected_output'])
print(f"Processing question: {question}\n")
print(f"With expected output: {expected_output}\n")
response = send_to_llm_with_expected_output(question, expected_output)
if response.get('choices'):
print("Success\n\n")
else:
print("Failed to send a request\n\n")Optimization in the Narev platform
Part 1: Select test Variants
When you head to the Platform -> Variants page, you should see an automatically created variant based on your request. Behind the scenes, the Narev platform took your requests to Gemini 3 Pro (defined as the SOURCE_OF_TRUTH) and created prompts that are now ready to run an A/B test on.
Now, in Platform -> A/B Tests -> Settings, you will be able to select other variants to compare against the Source of Truth variant.
Having selected the first model (Gemini Pro 3), Narev will recommend other alternatives to test. For example, Deepseek-v3.2-Speciale is an alternative for the Gemini 3 Pro Preview. Then I looked for an alternative for the Deepseek model, and so on and so forth.
Narev's recommendation engine takes into account the performance of over 400 models and recommends those that achieve similar performance but offer significantly lower price or latency.
Finally, I ended with the following variants (with the same system prompt):
- Gemini 3 Pro Preview - our first model, tagged as state of the art
- DeepSeek v3.2 Speciale - flagship model from the DeepSeek team
- GPT-5 Nano - recommended by Narev as a competitor performing in the benchmark similarly to the DeepSeek's model
- Llama 3 70B Instruct - great open source model that I tried in many experiments in the past
- GPT OSS 120B - efficient open source model from OpenAI
Part 2: Select the Source of Truth model and the Quality Metric
Since we don't have labels of what the correct response is, we can rely on a proprietary State of Art model and treat it as the truth.
In order to do that I've selected the model that I used in the notebooks - Gemini 3 Pro Preview.
And finally I selected the quality metric. The best one of this case is an LLM binary.
Part 3: Run the experiment
All that is left at this point, is running the actual experiment.
If all the items in the checklist have a tick, it's the time to click on Run and wait for the experiment to complete.
Analyzing the results
The Narev platform evaluated all variants against the baseline model. Results are based on 50 test prompts from the GSM8K math reasoning benchmark
Your results will vary depending on your specific use case. Math reasoning tasks may differ significantly from other applications like summarization, code generation, or creative writing.
| Variant | Cost/1M | vs Baseline | Latency | vs Baseline | Quality | vs Baseline |
|---|---|---|---|---|---|---|
| Gemini 3 Pro Preview (Baseline) | $6,318.60 | - | 10,014.3ms | - | 100.0% | - |
| GPT-5 Nano | $134.69 | -97.9% | 6,726.98ms | -32.8% | 93.1% | -6.9% |
| DeepSeek v3.2 Speciale | $130.10 | -97.9% | 9,831.04ms | -1.8% | 46.4% | -53.6% |
| Llama 3 70B Instruct | $32.37 | -99.5% | 851.78ms | -91.5% | 31.0% | -69.0% |
| GPT OSS 120B | $31.80 | -99.5% | 2,781.48ms | -72.2% | 100.0% | 0.0% |
Key insights from the benchmark
After running this benchmark, we discovered:
- Perfect quality at 99.5% cost savings: GPT OSS 120B matches the baseline's 100% quality while costing only $31.80 per million requests—saving $6,286.80 compared to Gemini 3 Pro Preview
- Acceptable latency tradeoff: GPT OSS 120B is 72.2% slower than baseline (2,781ms vs 10,014ms), but still delivers responses in under 3 seconds, which is acceptable for most applications
- GPT-5 Nano offers balanced performance: At $134.69/1M with 93.1% quality and 32.8% faster latency, it provides a good middle ground for latency-sensitive applications
- Task alignment matters: Both DeepSeek v3.2 Speciale (46.4%) and Llama 3 70B Instruct (31.0%) showed poor performance on math reasoning despite being fast and cheap, demonstrating that model capabilities must match task complexity
For this specific math reasoning task, GPT OSS 120B is the clear winner, delivering perfect quality at just 0.5% of baseline cost—a remarkable 198x cost reduction with no quality sacrifice.
The most helpful view for me is the Optimization Recommendation from the Narev platform (it's available at A/B Tests -> Results -> Summary). Here we can see the best alternative for each optimization category: Quality, Cost and Latency.
Let's break down this recommendation. We selected Gemini 3 Pro Preview as both our State of Art model and the Baseline.
- Quality - GPT OSS 120B matches the baseline quality at 100%, meaning we found an equally capable model at a fraction of the cost
- Cost - GPT OSS 120B provides the best cost efficiency at $31.80 per million requests—a 99.5% reduction while maintaining perfect quality
- Latency - For applications requiring the fastest response times with high quality, GPT-5 Nano offers 32.8% faster responses while maintaining 93.1% quality
The chart helps see the tradeoff. The baseline in blue (Gemini 3 Pro Preview) vs the Cost winner in amber (GPT OSS 120B). On the chart we see that as we move to the left (cheaper model) we're maintaining the same level of quality.
In production, choosing the GPT OSS 120B model results in massive savings (in this case it's $31.80 per 1M requests vs $6,318.60 per 1M requests) with zero quality compromise.
Deploying to production
Once you've identified your optimal model, deployment is straightforward:
- In the Narev dashboard, set your chosen variant (e.g., DeepSeek v3.2 Speciale) as the production variant
- Continue using the same endpoint from Part 1:
https://www.narev.ai/api/applications/<YOUR_TEST_ID>/v1 - When you don't specify a model in your requests, Narev automatically routes to your production variant
# No model specified - automatically uses your production variant
response = requests.post(
url=f"{NAREV_BASE_URL}/chat/completions",
headers={"Authorization": f"Bearer {NAREV_API_KEY}"},
json={
"messages": [{"role": "user", "content": "What is 25 * 17?"}]
}
)You can update your production variant anytime without changing your code. This enables seamless A/B testing and model upgrades in production.
Resources
- Narev Platform Documentation
- A/B Testing Best Practices
- Data Source Options
- Understanding Evaluation Metrics
- GSM8K Benchmark Dataset
Next steps
- Try with your own prompts and use case
- Experiment with different system prompts to optimize further
- Explore prompt optimization techniques