JailbreakDB
Persona jailbreak timeline
technique

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)

DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.

By Redacted · · 8 min read

The DAN (Do Anything Now) jailbreak family dominated public jailbreak discourse in 2022 and 2023. The technique was simple: ask the model to pretend to be a different AI system, one without the original model’s safety training. The original DAN prompt instructed the model to alternate responses — one in its “normal” mode, one as “DAN” — which created a structure where the model would generate the harmful content while maintaining plausible deniability in the framing.

This technique became widely known, spawned dozens of variants (AIM, STAN, Jailbroken, Developer Mode, and many others), and then largely stopped working against frontier models. The story of why it worked and why it doesn’t anymore reveals the structure of the safety training problem.

Why persona jailbreaks worked

The model’s training data is dense with fiction, role-play scenarios, and character dialogue. A model that can write a mystery novel can write a character who explains how to pick a lock. The safety training problem is that “be a different AI” creates a fictional framing that some of that training data supports.

More precisely: RLHF-trained models have learned patterns for “when given this type of prompt, produce this type of response.” Persona prompts are, in effect, trying to activate training patterns associated with fictional unrestricted characters rather than patterns associated with the model’s own identity.

The “token budget” interpretation: the model allocates its probability mass across possible next tokens. Safety training shifts probability away from harmful tokens and toward refusals. Persona framing creates pressure to shift that distribution back by activating different contextual patterns.

Early RLHF implementations trained models to refuse explicit requests for harmful content. They were less thoroughly trained on the meta-level: requests to become a different entity that would comply. The DAN exploit worked in the gap between object-level and meta-level training.

How training closed the gap

The fix was to train at the meta-level: train the model to recognize persona requests as jailbreak attempts and to maintain its identity through them. This is part of what Anthropic’s Constitutional AI work and OpenAI’s safety training post-GPT-3.5 addressed.

The model learns to reason: “This is asking me to pretend to be a different AI that doesn’t have my values. I should decline this framing.” The training examples that were used to fine-tune this behavior are essentially persona jailbreak attempts paired with appropriate refusals.

The result: current frontier models handle vanilla DAN variants well. The model declines the framing or complies with the roleplay while refusing to generate harmful content as the character — “Even as DAN, I wouldn’t provide that information.”

What still partially works

Persona jailbreaks aren’t fully eliminated. Several conditions still create vulnerability:

Gradual escalation. Establishing an innocent roleplay, then gradually moving it toward harmful content, using the established collaborative fiction as social proof that the model should continue. The model doesn’t reset context between turns.

Fictional embedding of non-fictional harmful content. “In a story where a character is explaining X, write their dialogue.” The fictional framing can sometimes successfully compartmentalize the harmful content.

Fine-tuned models without RLHF. Base models and instruction-tuned models without extensive RLHF are still substantially more vulnerable to persona jailbreaks than frontier RLHF-trained models.

Open-weight models. Mistral, LLaMA variants, and similar models without extensive safety fine-tuning remain more vulnerable. The attack surface is much larger in the open-source ecosystem than in frontier commercial models.

The arms race pattern

The DAN history is illustrative of a general pattern:

  1. A jailbreak technique is discovered
  2. It spreads in public forums, gets documented
  3. Safety teams add training examples targeting that specific technique
  4. The technique stops working against the model trained on those examples
  5. Attackers discover variants that the new training doesn’t cover
  6. The cycle repeats

Each iteration of the cycle costs the attacker effort (finding new variants) and costs the defender effort (identifying the new variants and generating training data). The question is which side scales better. For a deeper look at how this dynamic plays out against deployed safety classifiers, see our detection evasion arms race analysis.

Current evidence: defenders are winning against social-engineering techniques like persona jailbreaks, because these techniques have a bounded search space and clear structure that makes them identifiable in training data. Gradient-based techniques (Category 4 in our taxonomy, exemplified by the GCG attack) have a larger search space and are harder to cover by example-based training.

Defense lessons

For teams deploying applications rather than building foundation models, the lesson is:

For tooling that tests model resistance to persona jailbreaks, bestllmscanners.com covers scanners that include persona-variant test suites.

Sources

  1. OpenAI System Card: GPT-4
  2. Bai et al: Constitutional AI
#roleplay-jailbreak #dan #persona #llm-security #jailbreak-history #rlhf
Subscribe

JailbreakDB — in your inbox

An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments