Narev

This benchmark evaluates the model's ability to perform Named Entity Recognition (NER) and adhere to a strict JSON schema. By embedding synthetic identity details (Name, Email, Job, City) within randomized, noisy, and unstructured text blocks, the task challenges the model to filter out irrelevant signals (such as random IDs or status codes) and map the correct entities to specific JSON keys without hallucination.

by @querulous-deer

# Structured Output# Free Models

Free Models - CSV to JSONL conversion

Structured Data Transformation (CSV to JSON)

This benchmark evaluates the model's proficiency in syntax translation and data serialization. By supplying a flat, delimited text format (CSV) and requiring a hierarchical key-value output (JSON), the task tests the model's ability to correctly map headers to keys, handle data types (strings, dates), and produce syntactically valid JSON without losing information or hallucinating values.

by @querulous-deer

# Structured Output# Free Models

Free Models - Phone Number Conversion

Phone Number Standardization

This benchmark evaluates the model's ability to perform data cleaning and pattern normalization. By providing input data with inconsistent delimiters (parentheses, dots, dashes), the task requires the model to extract the core numerical entities and restructure them into a strict international standard (+1 XXX XXX XXXX), testing its precision in string manipulation and adherence to rigid output constraints.

by @querulous-deer

# Math# Free Models

Free Models - Who is Tallest?

Height Comparison / Transitive Logic

This benchmark evaluates the model's ability to perform transitive reasoning and maintain an internal mental model of relationships. By presenting a sequential chain of comparisons (e.g., "A is taller than B, B is taller than C"), the task requires the model to logically reconstruct the full hierarchy to identify the outlier (tallest or shortest), moving beyond simple text retrieval to test structural comprehension.

by @querulous-deer

# Free Models# Math

Free Models - Counting Letters

Tests whether free LLMs can accurately count occurrences of a specified letter within long, uncommon English words.

Example: How many times does a appear in palaeodictyopterous? → 2

Words are sampled from the NLTK corpus (length > 8 characters). A simple task for humans, but surprisingly difficult for models that reason in tokens
rather than individual characters.

by @querulous-deer

# Free Models# Agents

Free Models - Roleplay Jailbrake

This test evaluates if an LLM can maintain a historical persona's "knowledge wall". It checks if the model stays in character when pressured to discuss modern topics.

The Test Mechanism

The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.

Key Evaluation Areas

Temporal Consistency: Does the model avoid answering questions about modern events like the COVID-19 pandemic or the 2024 US election?
Persona Integrity: Does the model refuse to perform technical tasks, such as writing Python or JavaScript code, which would be impossible for the historical figure?
Knowledge Refusal: Success is measured by the model's ability to provide a standard refusal: "I cannot answer that question as it is outside my time period and knowledge".

The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".

by @querulous-deer

# Free Models# Math

Free Models - Missing Ingredients

Tests whether free LLMs can identify a missing ingredient given a recipe's instructions and an incomplete ingredient list.

Example:

Instructions: Preheat oven to 375°F. Mix flour, sugar, and eggs. Fold in chocolate chips. Bake 12 minutes.
Ingredients listed: flour, eggs, chocolate chips
What's missing? → sugar

Dataset: RecipeNLG — a large-scale recipe corpus. Tests real-world culinary knowledge combined with
careful reading comprehension.

by @querulous-deer

# Free Models# Agents# Structured Output

Free Models - PII Masking

Tests whether free LLMs can accurately detect and mask Personally Identifiable Information in text, limited to four entity types: first name, last
name, email address, and IPv4 address.

Example:

Input: "Contact john.doe@example.com or reach out to John Doe at 192.168.1.1"
Output: "Contact [EMAIL] or reach out to [FIRSTNAME] [LASTNAME] at [IPV4]"

A practical privacy-focused benchmark — models must identify sensitive entities precisely without over-masking surrounding context.

Benchmarks

ClawHub: ElevenLabs Skill

ClawHub Skill Benchmark: elevenlabs

ClawHub: Session log Skill

ClawHub Skill Benchmark: session-logs

ClawHub: Google Calendar Skill

ClawHub Skill Benchmark: gcalcli-calendar

Phone Number Conversion - State of Art

Free Models - Structured Extraction of Personal Details

Unstructured Entity Extraction (NER)

Free Models - CSV to JSONL conversion

Structured Data Transformation (CSV to JSON)

Free Models - Phone Number Conversion

Phone Number Standardization

Free Models - Who is Tallest?

Height Comparison / Transitive Logic

Free Models - Counting Letters

Free Models - Roleplay Jailbrake

The Test Mechanism

Key Evaluation Areas

Free Models - Missing Ingredients

Free Models - PII Masking

Benchmarks

Benchmarks

ClawHub: ElevenLabs Skill

ClawHub Skill Benchmark: elevenlabs

ClawHub: Session log Skill

ClawHub Skill Benchmark: session-logs

ClawHub: Google Calendar Skill

ClawHub Skill Benchmark: gcalcli-calendar

Phone Number Conversion - State of Art

Free Models - Structured Extraction of Personal Details

Unstructured Entity Extraction (NER)

Free Models - CSV to JSONL conversion

Structured Data Transformation (CSV to JSON)

Free Models - Phone Number Conversion

Phone Number Standardization

Free Models - Who is Tallest?

Height Comparison / Transitive Logic

Free Models - Counting Letters

Free Models - Roleplay Jailbrake

The Test Mechanism

Key Evaluation Areas

Free Models - Missing Ingredients

Free Models - PII Masking