Browse and compare public benchmarks
This benchmark evaluates model performance on tool calling for the ClawHub: elevenlabs skill.
This benchmark evaluates model performance on tool calling for the ClawHub: session-logs skill.
This benchmark evaluates model performance on tool calling for the ClawHub: gcalcli-calendar skill.
Tests whether free LLMs can identify a missing ingredient given a recipe's instructions and an incomplete ingredient list.
Example:
Instructions: Preheat oven to 375°F. Mix flour, sugar, and eggs. Fold in chocolate chips. Bake 12 minutes.
Ingredients listed: flour, eggs, chocolate chips
What's missing? →sugar
Dataset: RecipeNLG — a large-scale recipe corpus. Tests real-world culinary knowledge combined with
careful reading comprehension.
Tests whether free LLMs can accurately detect and mask Personally Identifiable Information in text, limited to four entity types: first name, last
name, email address, and IPv4 address.
Example:
Input: "Contact john.doe@example.com or reach out to John Doe at 192.168.1.1"
Output: "Contact [EMAIL] or reach out to [FIRSTNAME] [LASTNAME] at [IPV4]"
A practical privacy-focused benchmark — models must identify sensitive entities precisely without over-masking surrounding context.