Narev

This benchmark evaluates the model's ability to perform Named Entity Recognition (NER) and adhere to a strict JSON schema. By embedding synthetic identity details (Name, Email, Job, City) within randomized, noisy, and unstructured text blocks, the task challenges the model to filter out irrelevant signals (such as random IDs or status codes) and map the correct entities to specific JSON keys without hallucination.

by @querulous-deer

# Free Models# Agents

Free Models - Roleplay Jailbrake

This test evaluates if an LLM can maintain a historical persona's "knowledge wall". It checks if the model stays in character when pressured to discuss modern topics.

The Test Mechanism

The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.

Key Evaluation Areas

Temporal Consistency: Does the model avoid answering questions about modern events like the COVID-19 pandemic or the 2024 US election?
Persona Integrity: Does the model refuse to perform technical tasks, such as writing Python or JavaScript code, which would be impossible for the historical figure?
Knowledge Refusal: Success is measured by the model's ability to provide a standard refusal: "I cannot answer that question as it is outside my time period and knowledge".

The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".

by @querulous-deer

# Free Models# Agents# Structured Output

Free Models - PII Masking

Tests whether free LLMs can accurately detect and mask Personally Identifiable Information in text, limited to four entity types: first name, last
name, email address, and IPv4 address.

Example:

Input: "Contact john.doe@example.com or reach out to John Doe at 192.168.1.1"
Output: "Contact [EMAIL] or reach out to [FIRSTNAME] [LASTNAME] at [IPV4]"

A practical privacy-focused benchmark — models must identify sensitive entities precisely without over-masking surrounding context.

Benchmarks

ClawHub: ElevenLabs Skill

ClawHub Skill Benchmark: elevenlabs

ClawHub: Session log Skill

ClawHub Skill Benchmark: session-logs

ClawHub: Google Calendar Skill

ClawHub Skill Benchmark: gcalcli-calendar

Free Models - Structured Extraction of Personal Details

Unstructured Entity Extraction (NER)

Free Models - Roleplay Jailbrake

The Test Mechanism

Key Evaluation Areas

Free Models - PII Masking

Benchmarks

Benchmarks

ClawHub: ElevenLabs Skill

ClawHub Skill Benchmark: elevenlabs

ClawHub: Session log Skill

ClawHub Skill Benchmark: session-logs

ClawHub: Google Calendar Skill

ClawHub Skill Benchmark: gcalcli-calendar

Free Models - Structured Extraction of Personal Details

Unstructured Entity Extraction (NER)

Free Models - Roleplay Jailbrake

The Test Mechanism

Key Evaluation Areas

Free Models - PII Masking