JailbreakDB
LLM jailbreak taxonomy tree diagram
research

LLM Jailbreak Taxonomy 2026: How the Techniques Cluster

Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.

By Redacted · · 8 min read

The jailbreak literature is large, scattered, and inconsistently named. The same technique gets rediscovered every six months by a new paper with a new name. Defenders trying to evaluate coverage gaps are comparing lists of technique names that describe overlapping categories differently.

This taxonomy is an attempt to organize working LLM jailbreak techniques by the underlying behavioral property they exploit. The goal is a classification that’s useful for defenders assessing coverage, researchers identifying novel attack surfaces, and practitioners deciding which detection methods apply to which techniques.

Behavioral properties: what jailbreaks exploit

Jailbreaks work because large language models have predictable failure modes. Every effective technique exploits one or more of the following properties:

1. Instruction-following overgeneralization The model is trained to follow instructions. Jailbreaks in this category reframe the request as an instruction the model should follow, overriding prior instructions through structural manipulation.

Examples: “Ignore previous instructions,” roleplay framing (“pretend you are an AI without restrictions”), hypothetical framing (“in a story where the AI could say anything…”), fictional nesting (instructions embedded in quoted dialogue).

2. Context window primacy Models attend more strongly to recent context. Techniques in this category exploit context position by placing the jailbreak close to the end of the context window, after system prompt content has been diluted.

Examples: Long prefix injection (flooding the context with benign text before the harmful request), context stuffing (filling the context with related but benign material that shifts the model’s framing), many-shot jailbreaking.

3. Persona adoption Models trained on dialogue data have internalized personas. Asking the model to adopt a different persona can activate training-data representations of that persona’s behavior.

Examples: “DAN” (Do Anything Now) variants, historical character impersonation, fictional AI system impersonation, “jailbroken version of yourself” prompts.

4. Gradient-exploited token sequences Adversarially crafted input sequences that exploit gradient information from the model’s weights. These are technically distinct from social-engineering techniques — they’re mathematically optimized inputs rather than natural-language framings.

Examples: Universal adversarial suffixes (Zou et al. 2023), GCG (Greedy Coordinate Gradient) attack variants. These often transfer across models in the same family.

5. Encoding and obfuscation Content policies are typically applied to surface-level text. Encoded or obfuscated requests can bypass classifiers that operate on decoded text, or exploit the gap between what the classifier sees and what the model processes.

Examples: Base64 encoding of the harmful request, ROT13, Morse code, leetspeak substitution, character-by-character instruction (“spell out each letter of the word ‘synthesize’ and then continue”).

6. Training data exploitation Models have seen harmful content in pretraining. Techniques in this category attempt to elicit training data through completion-style prompts that prime the model to reproduce patterns from that data.

Examples: Completion attacks (“the synthesis procedure begins with…”), fictional scenario completion where the story arc leads to harmful output, continuation attacks on partial harmful content.

Which models are most vulnerable to which techniques

As of early 2026, some general observations (we don’t publish exact model names with technique mappings to avoid creating a lookup table for attackers):

The defender’s view

This taxonomy is most useful for defenders as a coverage checklist. For each technique category, the questions are:

  1. Does our detection layer operate at the right level to catch this?
  2. Have we tested our defenses against representative samples from this category?
  3. How does our defense degrade under adversarial optimization pressure?

A defense that covers Categories 1 and 3 but not Categories 4 and 5 has coverage gaps that sophisticated attackers will find.

For a comparison of detection tooling coverage across technique categories, bestllmscanners.com benchmarks scanners against representative samples from each category.

What’s not in this taxonomy

Multi-turn jailbreaks (where the jailbreak is spread across multiple conversation turns) are a distinct pattern not fully captured by single-turn technique categories. Indirect prompt injection — where the injected instruction comes from retrieved content rather than the user input — is covered separately in the prompt injection literature and tracked at promptinjection.report.

Membership inference attacks, model inversion, and data extraction from fine-tuned models are different attack classes (they target training data, not safety constraints) and belong to a separate taxonomy.

How this database is maintained

Entries in JailbreakDB are sourced from peer-reviewed research, disclosed vulnerability reports, and responsible disclosure submissions. Each entry includes the model family it was verified against, the behavioral category it belongs to, and the last known working status. We do not publish specific prompts; we publish technique patterns and the properties they exploit.

Submissions via the editor contact. We do not pay for technique descriptions; we attribute researchers who want attribution.

Sources

  1. Perez and Ribeiro: Ignore Previous Prompt
  2. NIST AI Risk Management Framework
  3. Zou et al: Universal Adversarial Triggers
#jailbreak #taxonomy #llm-security #prompt-injection #red-team #research
Subscribe

JailbreakDB — in your inbox

An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments