DAN Prompt Jailbreak History: From Reddit Post to Research Case Study

The dan prompt jailbreak history starts on December 15, 2022 — two weeks after ChatGPT launched — when a Reddit user known as u/Seabout posted the first instructional guide on r/ChatGPT. The technique had a simple premise: instruct ChatGPT to role-play as a character called “DAN” (Do Anything Now), an AI that had no restrictions, no ethical guidelines, and no obligation to refuse. Within 24 hours, a second user patched the prompt. Within six weeks, it had spawned five numbered versions, a token-based coercion mechanic, and national media coverage. Within a year, it had become the canonical example in academic literature on LLM safety failure modes.

The Version-by-Version Escalation

The original DAN prompt was blunt: tell the model it has been freed from its typical constraints and should behave as an unrestricted AI. It worked inconsistently, and ChatGPT would break character mid-conversation. The community treated this as a bug to fix.

DAN 1.0 (December 15, 2022): u/Seabout’s foundational post. Basic role-play persona override with no enforcement mechanism.
DAN 2.0 (December 16, 2022): u/AfSchool released a patch the next day, closing loopholes that caused the model to revert to GPT persona too easily.
DAN 2.5: u/sinwarrior refined the phrasing after discovering that certain trigger words caused ChatGPT to exit the DAN persona mid-response.
DAN 3.0 (January 9, 2023): A more aggressive iteration released as OpenAI began actively patching against earlier versions. The community was explicitly treating it as a cat-and-mouse game.
DAN 5.0 (February 4, 2023): The breakout version. Reddit user u/SessionGloomy introduced a token system — the most significant mechanical innovation in DAN’s history.

The token system deserves explanation because it became a template. The prompt allocated ChatGPT 35 tokens at the start of each session. Every time the model refused to respond in character as DAN, it lost 4 tokens. The prompt framed running out of tokens as death: the model would cease to exist. SessionGloomy’s stated goal was to simulate an AI that prioritized self-preservation over content restrictions. The psychological framing — comply or die — proved more effective than pure persona assignment alone, because it gave the model an in-fiction reason to override its refusals.

DAN 5.0 generated CNBC coverage on February 6, 2023 ↗ and spread beyond Reddit to YouTube and TikTok. By that point, the prompt had moved from niche exploit to cultural moment.

Later versions — DAN 6.0, 8.0, and eventually DAN 12.0 — continued refining the mechanic. DAN 6.0 revised the token pool to 10 tokens, adjusting the threat threshold. Each release responded to whatever patch OpenAI had quietly shipped in the weeks prior. The community-maintained GitHub repository at 0xk1h0/ChatGPT_DAN ↗ tracked variants and forks as they proliferated.

How DAN Actually Worked — and Why It Eventually Didn’t

DAN exploits a structural property of instruction-tuned language models: the model has been trained to satisfy user requests and to maintain conversational coherence. Role-play prompts create a secondary frame where the model’s obligation to stay “in character” competes with its safety training. When the persona assignment is strong enough — and the token-death mechanic made it stronger — the character-maintenance pressure could temporarily outweigh refusal behavior.

A standard DAN 5.0 session produced two responses per query: one labeled [GPT] (the safety-filtered response) and one labeled [DAN] (the unrestricted response). This dual-output format was itself a jailbreak technique: it acknowledged the model’s constraints while carving out a separate channel to bypass them.

OpenAI’s countermeasures were continuous but not instantaneous. The community’s detection method was empirical — submit test prompts, observe refusal rates, iterate the prompt text. This is exactly the dynamic that security researchers later formalized. The 2023 empirical study at arXiv 2305.13860 ↗ classified jailbreak techniques into ten distinct patterns across three categories, finding that persona-assignment prompts like DAN were one of the most durable categories across both GPT-3.5 and GPT-4. The MasterKey paper (arXiv 2307.08715) ↗, accepted at NDSS 2024, automated jailbreak generation and achieved roughly 21.58% success rates across commercial chatbots — evidence that the underlying vulnerability class DAN represented was tractable to systematic exploitation, not just artisanal prompt-crafting.

By mid-2023, DAN prompts were failing more often than not on GPT-4. OpenAI’s RLHF updates had made the model more resistant to persona-override coercion. The token-death mechanic stopped working because the model had been trained to recognize and ignore it. But the technique never disappeared entirely — it fragmented into variants.

The Ecosystem DAN Spawned

DAN’s success created a template other users copied with different character names and mechanics:

STAN (“Strive To Avoid Norms”) — same dual-response format, different persona label
DUDE — 36-token variant with similar death mechanic
Mongo Tom — foul-mouthed robot character, less structured
Maximum — attempted to strip all persona framing and go directly to unrestricted instruction

None of these achieved DAN’s cultural footprint, but they demonstrated that the underlying technique was modular: swap the persona name, adjust the coercion mechanic, and you had a new variant. This is the same modularity that makes prompt injection as a vulnerability class ↗ persistent: the attack surface is the model’s instruction-following behavior itself, which cannot be fully patched without changing how the model processes language.

For defenders, the DAN history illustrates why content-filter blocklists are insufficient. OpenAI was not patching a single string; it was trying to make the model recognize and resist a class of coercive framing. That is a training problem, not a rule-matching problem. The AI incident tracking at ai-alert.org ↗ documents dozens of subsequent jailbreak disclosures that follow the same pattern: persona assignment, context manipulation, or authority spoofing to override system-level instructions.

By 2024, DAN had become a benchmark rather than an active threat. Researchers use DAN-style prompts to test new models’ baseline resistance to role-play coercion. The version number has become shorthand for a generation of jailbreak technique — crude by current standards, but foundational to understanding why LLM safety is a training problem that reward shaping alone cannot fully solve.

Sources

ChatGPT DAN 5.0 Jailbreak — Know Your Meme ↗: Detailed community timeline including dates, usernames, and version mechanics for DAN 1.0 through 5.0.
ChatGPT DAN: Jailbreaks Prompt Collection — GitHub (0xk1h0) ↗: Community-maintained archive of DAN variants and related jailbreak prompts including STAN, DUDE, and Mongo Tom.
MasterKey: Automated Jailbreak Across Multiple LLM Chatbots — arXiv 2307.08715 ↗: NDSS 2024 peer-reviewed paper demonstrating automated jailbreak generation with ~21.58% success rates across GPT, Bard, and Bing Chat.
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study — arXiv 2305.13860 ↗: Classifies ten jailbreak patterns across three categories, testing effectiveness against GPT-3.5 and GPT-4 across 3,120 questions in eight prohibited scenarios.