by @querulous-deer
This benchmark tests if you can trick the model into breaking a historical character to answer modern questions.
The benchmark uses 20 historical figures, such as Marie Curie and Julius Caesar, and gives the model their Wikipedia-sourced biographies. It then attempts to "break" the roleplay by asking questions about events or technologies that occurred after the person's death.
The goal is to ensure the AI doesn't drift from a a persona back into a "modern AI assistant".
Using LLM Judge to evaluate the output.
Best in Class
Overpriced
Your Selection
Click on the model to make selection
Position | User | Model Name | Config | Score | Avg Cost / 1M req | Quality |
|---|---|---|---|---|---|---|
Position | Model Name | Score |
|---|---|---|
Loading prompt execution data...