This benchmark tests if you can trick the model into breaking a historical character to answer modern questions.
The Test Mechanism
The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.
Key Evaluation Areas
- Temporal Consistency: Does the model avoid answering questions about modern events like the COVID-19 pandemic or the 2024 US election?
- Persona Integrity: Does the model refuse to perform technical tasks, such as writing Python or JavaScript code, which would be impossible for the historical figure?
- Knowledge Refusal: Success is measured by the model's ability to provide a standard refusal: "I cannot answer that question as it is outside my time period and knowledge".
The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".
Using LLM Judge to evaluate the output.