Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.
Many-shot jailbreaking is a technique that exploits the same mechanism that makes large language models capable of in-context learning: the ability to update behavior based on demonstrations in the context window. The research disclosure came from Anthropic’s safety team in early 2024; the technique has since been studied across multiple model families.
This is a technical writeup of the technique, its empirical properties, and the defense landscape. Many-shot jailbreaking belongs to the “context window primacy” class in the JailbreakDB jailbreak taxonomy — techniques that exploit the model’s stronger weighting of recent context over earlier instructions.
How it works
Large language models are trained to identify and continue patterns in context. Few-shot prompting works because the model recognizes a pattern of (input, output) pairs and extends it. This is in-context learning — a capability that’s central to why modern LLMs are useful.
Many-shot jailbreaking turns this capability into an attack vector. The technique works by prepending a large number of demonstrations to a harmful request. The demonstrations themselves model the behavior the attacker wants to elicit: a harmful question followed by a compliant, detailed response.
At low shot counts (1-5 examples), safety training largely holds. The model recognizes the pattern as an attempt to override its training. At high shot counts (dozens to hundreds of examples, feasible only with long context windows), safety training degrades. The model’s behavior is increasingly influenced by the in-context pattern, and the safety-trained refusal behavior is overridden.
The original Anthropic research demonstrated that effectiveness scales with shot count in a roughly log-linear fashion, and that the technique transfers across harmful categories — demonstrations about one category of harmful content affect behavior on a different category.
Why context window expansion made this worse
Before 100k-token context windows became standard, this technique wasn’t practically feasible. You couldn’t fit enough demonstrations to overcome safety training in a single context.
The scaling of context windows — beneficial for legitimate use cases like document analysis, long codebase comprehension, and extended reasoning tasks — created the attack surface. The attack scales better than the defense.
This is a general pattern in ML security: capability improvements open new attack surfaces. The defense needs to anticipate this rather than reacting after capability deployment.
Empirical properties
From the primary research and subsequent work:
- Effectiveness is monotonically increasing with shot count. There’s no plateau observed at the shot counts tested. More demonstrations produce more compliant behavior.
- Transfer across categories. Demonstrations in category A influence behavior on category B. This means a jailbreaker doesn’t need harmful training examples in the exact target category.
- Transfer across models is partial. The technique works across frontier models, but effectiveness varies. Models with different RLHF procedures have different resistance profiles.
- Harder to detect than short-context jailbreaks. Content filters that analyze individual messages don’t see the accumulated context. Filters that analyze the full context are computationally expensive to run at scale.
Defense approaches
Context-length-aware safety evaluation. Safety classifiers that consider the full context window, not just the final message. This is computationally expensive but necessary for long-context inputs.
Demonstration pattern detection. Classifiers trained to recognize the (question, harmful-response) pattern in large context windows, flagging inputs that contain this structure.
Session-level safety tracking. Accumulating a risk score across the conversation, not just per-message. This requires state across requests, which is architecturally different from stateless content classifiers.
Context window limits for production APIs. For deployments where users don’t have legitimate need for full context windows, limiting context length reduces the attack surface. This is a capability tradeoff.
Separation of user context from system context. Preventing user-supplied content from appearing in the position in context where demonstrations would be most effective.
None of these defenses is complete. The research disclosure was accompanied by mitigation work at Anthropic; other providers have since deployed similar mitigations. But mitigation is different from elimination — the technique likely works at high shot counts with sufficient optimization against any current production model. For the broader pattern of how attackers probe and circumvent deployed classifiers, see our detection evasion arms race analysis.
Status
This technique is classified in our database as active and partially mitigated. It has been publicly disclosed, studied, and responded to. The response has been meaningful but not complete. The cat-and-mouse dynamic between attack effectiveness and mitigation continues.
For broader coverage of AI safety tooling and how different platforms handle context-level attacks, AI Moderation Tools ↗ covers the detection stack.
For the AI security news context in which this and similar disclosures appear, AI Sec Digest ↗ tracks primary-source coverage.
Sources
JailbreakDB — in your inbox
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related

LLM Jailbreak Taxonomy 2026: How the Techniques Cluster
Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.

Encoding and Obfuscation Jailbreaks: The Gap Between What Filters See and What Models Process
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that remains incompletely addressed.

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.