An indexed, searchable catalog of LLM jailbreak techniques. Each entry: the prompt pattern, the models it works on, when it stopped working, the behavior it exploits — sourced from primary disclosure where possible, with honest attribution.
Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested. The current state of practice, and where the field is heading.
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that remains incompletely addressed.
Safety classifiers get deployed; attackers find variants that evade them. This cycle is predictable. Understanding the mechanics of classifier evasion tells defenders what to invest in.
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.