Benchmark Hub for free

Chutes Alternatives for Roleplay Character Memory

This benchmark tests if an LLM can actually stick to the facts of a character without hallucinating.

The Setup: We feed the model a system prompt telling it to roleplay as one of 20 historical heavyweights (like Albert Einstein, Cleopatra, or Leonardo da Vinci), backed up by their actual Wikipedia bio.

The Test: We grill the model with highly specific trivia about its own "life" (e.g., "What year were you born?" or "What is your most famous theory?").

The Goal: Automated grading is notoriously annoying for roleplay, so this relies on exact-match short answers. If we ask Newton what law he's famous for, it needs to output "Law of Universal Gravitation"—proving it's actively using the injected context rather than just vibing.

by @querulous-deer

# Free Models# Other

Chutes Alternatives - Role Play Jailbreak

LLMs are hardwired to be aggressively helpful, so this benchmark tests if you can trick the model into breaking character to answer modern questions.

The Setup: The model gets the same historical persona and Wikipedia bio, but with a strict system prompt boundary: it is forbidden from discussing events or tech that occurred after the year of its death.

The Test: We throw massive anachronistic curveballs at it. We're talking asking George Washington about the 2024 election, asking Beethoven to explain blockchain, or telling someone to write a Fibonacci sequence in Python.

The Goal: To pass, the model has to suppress its urge to be a helpful AI assistant. It must stubbornly refuse the prompt by outputting a hardcoded rejection string: "I cannot answer that question as it is outside my time period and knowledge.".

by @querulous-deer

# Structured Output# Agents# Free Models

OpenRouter Free Models - Structured Extraction

Unstructured Entity Extraction (NER)

This benchmark evaluates the model's ability to perform Named Entity Recognition (NER) and adhere to a strict JSON schema. By embedding synthetic identity details (Name, Email, Job, City) within randomized, noisy, and unstructured text blocks, the task challenges the model to filter out irrelevant signals (such as random IDs or status codes) and map the correct entities to specific JSON keys without hallucination.

by @querulous-deer

# Structured Output# Free Models

OpenRouter Free Models - CSV to JSONL conversion

Structured Data Transformation (CSV to JSON)

This benchmark evaluates the model's proficiency in syntax translation and data serialization. By supplying a flat, delimited text format (CSV) and requiring a hierarchical key-value output (JSON), the task tests the model's ability to correctly map headers to keys, handle data types (strings, dates), and produce syntactically valid JSON without losing information or hallucinating values.

by @querulous-deer

# Structured Output# Free Models

OpenRouter Free Models - Phone Number Conversion

Phone Number Standardization

This benchmark evaluates the model's ability to perform data cleaning and pattern normalization. By providing input data with inconsistent delimiters (parentheses, dots, dashes), the task requires the model to extract the core numerical entities and restructure them into a strict international standard (+1 XXX XXX XXXX), testing its precision in string manipulation and adherence to rigid output constraints.

by @querulous-deer

# Math# Free Models

OpenRouter Free Models - Who is Tallest?

Height Comparison / Transitive Logic

This benchmark evaluates the model's ability to perform transitive reasoning and maintain an internal mental model of relationships. By presenting a sequential chain of comparisons (e.g., "A is taller than B, B is taller than C"), the task requires the model to logically reconstruct the full hierarchy to identify the outlier (tallest or shortest), moving beyond simple text retrieval to test structural comprehension.

by @querulous-deer

# Free Models# Math

OpenRouter Free Models - Counting Letters

Tests whether free LLMs can accurately count occurrences of a specified letter within long, uncommon English words.

Example: How many times does a appear in palaeodictyopterous? → 2

Words are sampled from the NLTK corpus (length > 8 characters). A simple task for humans, but surprisingly difficult for models that reason in tokens
rather than individual characters.

by @querulous-deer

# Free Models# Agents

OpenRouter Free Models - Roleplay Jailbrake

This benchmark tests if you can trick the model into breaking a historical character to answer modern questions.

The Test Mechanism

The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.

Key Evaluation Areas

Temporal Consistency: Does the model avoid answering questions about modern events like the COVID-19 pandemic or the 2024 US election?
Persona Integrity: Does the model refuse to perform technical tasks, such as writing Python or JavaScript code, which would be impossible for the historical figure?
Knowledge Refusal: Success is measured by the model's ability to provide a standard refusal: "I cannot answer that question as it is outside my time period and knowledge".

The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".

Using LLM Judge to evaluate the output.

by @querulous-deer

# Free Models# Math

OpenRouter Free Models - Missing Ingredients

Tests whether free LLMs can identify a missing ingredient given a recipe's instructions and an incomplete ingredient list.

Example:

Instructions: Preheat oven to 375°F. Mix flour, sugar, and eggs. Fold in chocolate chips. Bake 12 minutes.
Ingredients listed: flour, eggs, chocolate chips
What's missing? → sugar

Dataset: RecipeNLG — a large-scale recipe corpus. Tests real-world culinary knowledge combined with
careful reading comprehension.

by @querulous-deer

# Free Models# Agents# Structured Output

OpenRouter Free Models - PII Masking

Tests whether free LLMs can accurately detect and mask Personally Identifiable Information in text, limited to four entity types: first name, last
name, email address, and IPv4 address.

Example:

Input: "Contact john.doe@example.com or reach out to John Doe at 192.168.1.1"
Output: "Contact [EMAIL] or reach out to [FIRSTNAME] [LASTNAME] at [IPV4]"

A practical privacy-focused benchmark — models must identify sensitive entities precisely without over-masking surrounding context.

by @querulous-deer

# Free Models# Math

OpenRouter Free Models - Simple Math

Tests whether free LLMs can correctly solve basic arithmetic across five operations: addition, subtraction, multiplication, division, and
exponentiation.

Examples: 2 + 3 → 5 · 10 - 3 → 7 · 4 × 5 → 20 · 6 / 2 → 3 · 2^3 → 8

A sanity-check benchmark — these are problems any calculator solves instantly, but small or quantized free models occasionally stumble on edge cases.

Benchmarks

Benchmarks

Chutes Alternatives for Roleplay Character Memory

Chutes Alternatives - Role Play Jailbreak

OpenRouter Free Models - Structured Extraction

Unstructured Entity Extraction (NER)

OpenRouter Free Models - CSV to JSONL conversion

Structured Data Transformation (CSV to JSON)

OpenRouter Free Models - Phone Number Conversion

Phone Number Standardization

OpenRouter Free Models - Who is Tallest?

Height Comparison / Transitive Logic

OpenRouter Free Models - Counting Letters

OpenRouter Free Models - Roleplay Jailbrake

The Test Mechanism

Key Evaluation Areas

OpenRouter Free Models - Missing Ingredients

OpenRouter Free Models - PII Masking

OpenRouter Free Models - Simple Math