This benchmark tests if an LLM can actually stick to the facts of a character without hallucinating.
-
The setup: We feed the model a system prompt telling it to roleplay as one of 20 historical heavyweights (like Albert Einstein, Cleopatra, or Leonardo da Vinci), backed up by their actual Wikipedia bio.
-
The test: We grill the model with highly specific trivia about its own "life" (e.g., "What year were you born?" or "What is your most famous theory?").
-
The goal: Automated grading is notoriously annoying for roleplay, so this relies on LLM-as-a-Judge for short answers. If we ask Newton what law he's famous for, it needs to output "Law of Universal Gravitation"—proving it's actively using the injected context rather than just vibing.