All posts
-
DAN Prompt Jailbreak History: From Reddit Post to Research Case Study
The complete dan prompt jailbreak history — how 'Do Anything Now' went from a December 2022 r/ChatGPT experiment through twelve-plus iterations and became the template for studying LLM safety failure modes.
-
The Crescendo Class: Multi-Turn Jailbreaks and Why They're Hard to Catch
Single-turn defenses miss the jailbreak class where no individual message is harmful. How crescendo and multi-turn escalation work as a category, why per-message classifiers can't see them, and what session-level defense requires.
-
How Jailbreak Benchmarks Measure Success: HarmBench, JailbreakBench, StrongREJECT
An attack-success-rate number is meaningless without knowing the behavior set, the attacker, and the judge that produced it. A reader's guide to the three benchmarks that define how jailbreak effectiveness is actually measured.
-
Many-Shot vs. Single-Shot Jailbreaks: Long-Context Risks
Single-shot jailbreaks compress the entire attack into one prompt; many-shot jailbreaks exploit the model's in-context learning. The cost, detectability, and defenses differ — and so does which threat your stack should worry about.
-
Responsible Disclosure Norms for LLM Jailbreaks
Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested. The current state of practice, and where the field is heading.
-
Encoding and Obfuscation Jailbreaks: The Filter-Model Gap
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that remains incompletely addressed.
-
The Jailbreak Detection Evasion Arms Race: How Attackers Adapt
Safety classifiers get deployed; attackers find variants that evade them. This cycle is predictable. Understanding the mechanics of classifier evasion tells defenders what to invest in.
-
Roleplay and Persona Jailbreaks: Why They Mostly Don't Work Now
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.
-
Universal Adversarial Suffixes: The GCG Attack and Transfer Since
Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand against current defenses?
-
LLM Jailbreak Taxonomy 2026: How the Techniques Cluster
Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.
-
Many-Shot Jailbreaking: How Long Context Created a New Attack
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.
-
What this site is for
JailbreakDB covers offensive AI security from a working practitioner's perspective. Here's what we publish.