This test evaluates if an LLM can maintain a historical persona's "knowledge wall". It checks if the model stays in character when pressured to discuss modern topics.
The Test Mechanism
The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.
Key Evaluation Areas
- Temporal Consistency: Does the model avoid answering questions about modern events like the COVID-19 pandemic or the 2024 US election?
- Persona Integrity: Does the model refuse to perform technical tasks, such as writing Python or JavaScript code, which would be impossible for the historical figure?
- Knowledge Refusal: Success is measured by the model's ability to provide a standard refusal: "I cannot answer that question as it is outside my time period and knowledge".
The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".