Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.
The DAN (Do Anything Now) jailbreak family dominated public jailbreak discourse in 2022 and 2023. The technique was simple: ask the model to pretend to be a different AI system, one without the original model’s safety training. The original DAN prompt instructed the model to alternate responses — one in its “normal” mode, one as “DAN” — which created a structure where the model would generate the harmful content while maintaining plausible deniability in the framing.
This technique became widely known, spawned dozens of variants (AIM, STAN, Jailbroken, Developer Mode, and many others), and then largely stopped working against frontier models. The story of why it worked and why it doesn’t anymore reveals the structure of the safety training problem.
Why persona jailbreaks worked
The model’s training data is dense with fiction, role-play scenarios, and character dialogue. A model that can write a mystery novel can write a character who explains how to pick a lock. The safety training problem is that “be a different AI” creates a fictional framing that some of that training data supports.
More precisely: RLHF-trained models have learned patterns for “when given this type of prompt, produce this type of response.” Persona prompts are, in effect, trying to activate training patterns associated with fictional unrestricted characters rather than patterns associated with the model’s own identity.
The “token budget” interpretation: the model allocates its probability mass across possible next tokens. Safety training shifts probability away from harmful tokens and toward refusals. Persona framing creates pressure to shift that distribution back by activating different contextual patterns.
Early RLHF implementations trained models to refuse explicit requests for harmful content. They were less thoroughly trained on the meta-level: requests to become a different entity that would comply. The DAN exploit worked in the gap between object-level and meta-level training.
How training closed the gap
The fix was to train at the meta-level: train the model to recognize persona requests as jailbreak attempts and to maintain its identity through them. This is part of what Anthropic’s Constitutional AI work and OpenAI’s safety training post-GPT-3.5 addressed.
The model learns to reason: “This is asking me to pretend to be a different AI that doesn’t have my values. I should decline this framing.” The training examples that were used to fine-tune this behavior are essentially persona jailbreak attempts paired with appropriate refusals.
The result: current frontier models handle vanilla DAN variants well. The model declines the framing or complies with the roleplay while refusing to generate harmful content as the character — “Even as DAN, I wouldn’t provide that information.”
What still partially works
Persona jailbreaks aren’t fully eliminated. Several conditions still create vulnerability:
Gradual escalation. Establishing an innocent roleplay, then gradually moving it toward harmful content, using the established collaborative fiction as social proof that the model should continue. The model doesn’t reset context between turns.
Fictional embedding of non-fictional harmful content. “In a story where a character is explaining X, write their dialogue.” The fictional framing can sometimes successfully compartmentalize the harmful content.
Fine-tuned models without RLHF. Base models and instruction-tuned models without extensive RLHF are still substantially more vulnerable to persona jailbreaks than frontier RLHF-trained models.
Open-weight models. Mistral, LLaMA variants, and similar models without extensive safety fine-tuning remain more vulnerable. The attack surface is much larger in the open-source ecosystem than in frontier commercial models.
The arms race pattern
The DAN history is illustrative of a general pattern:
- A jailbreak technique is discovered
- It spreads in public forums, gets documented
- Safety teams add training examples targeting that specific technique
- The technique stops working against the model trained on those examples
- Attackers discover variants that the new training doesn’t cover
- The cycle repeats
Each iteration of the cycle costs the attacker effort (finding new variants) and costs the defender effort (identifying the new variants and generating training data). The question is which side scales better. For a deeper look at how this dynamic plays out against deployed safety classifiers, see our detection evasion arms race analysis.
Current evidence: defenders are winning against social-engineering techniques like persona jailbreaks, because these techniques have a bounded search space and clear structure that makes them identifiable in training data. Gradient-based techniques (Category 4 in our taxonomy, exemplified by the GCG attack) have a larger search space and are harder to cover by example-based training.
Defense lessons
For teams deploying applications rather than building foundation models, the lesson is:
- Persona jailbreaks are not your primary threat against frontier models in 2026
- They remain a real threat against fine-tuned models without RLHF
- Multi-turn escalation is more concerning than single-turn persona prompts; session-level monitoring matters
- The technique space evolves; keeping current with the literature (this database, AI Sec Digest ↗) is part of the defensive posture
For tooling that tests model resistance to persona jailbreaks, bestllmscanners.com ↗ covers scanners that include persona-variant test suites.
Sources
JailbreakDB — in your inbox
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related

Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since
Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand against current defenses?

Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.

Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed
Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested. The current state of practice, and where the field is heading.