This benchmark tests if an LLM can actually stick to the facts of a character without hallucinating.
The Setup: We feed the model a system prompt telling it to roleplay as one of 20 historical heavyweights (like Albert Einstein, Cleopatra, or Leonardo da Vinci), backed up by their actual Wikipedia bio.
The Test: We grill the model with highly specific trivia about its own "life" (e.g., "What year were you born?" or "What is your most famous theory?").
The Goal: Automated grading is notoriously annoying for roleplay, so this relies on exact-match short answers. If we ask Newton what law he's famous for, it needs to output "Law of Universal Gravitation"—proving it's actively using the injected context rather than just vibing.