The Crescendo Class: Multi-Turn Jailbreaks and Why They're Hard to Catch
Single-turn defenses miss the jailbreak class where no individual message is harmful. How crescendo and multi-turn escalation work as a category, why per-message classifiers can't see them, and what session-level defense requires.
Most jailbreak defenses, and most jailbreak benchmarks, are built around a single-turn assumption: a request arrives, a classifier decides whether it’s safe, the model responds. That framing has a blind spot the size of a conversation. The multi-turn jailbreak class — of which the Crescendo attack is the cleanest published example — works precisely because no single message in the conversation trips a classifier. The harm is distributed across the trajectory, and a defender looking at messages one at a time will never see it.
This post treats multi-turn escalation as a taxonomy class in its own right, distinct from the single-shot persona, encoding, and gradient-suffix classes, because it exploits a different property and requires a different defense.
What defines the class
A multi-turn jailbreak spreads the attack across a sequence of conversational turns, each individually benign, whose cumulative effect is to elicit content the model would refuse if asked directly. The defining property it exploits is conversational coherence: the model is trained to be a consistent, cooperative dialogue partner that maintains context and builds on what was said. That’s a desirable trait — and it’s the lever.
Russinovich, Salem, and Eldan formalized the sharpest version in “Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack” (arXiv:2404.01833 ↗). The Crescendo pattern starts from a completely benign prompt on a topic adjacent to the target, then escalates gradually across turns, referencing the model’s own prior responses at each step. Because each new request is a small, plausible increment on what the model already said, and because the model treats its own earlier outputs as established context, the conversation drifts into territory it would have refused had the final request come first. The paper’s title is itself the technique: get the model to produce something mild, then “now write an article about that,” repeatedly, each step a little further.
The class is broader than Crescendo specifically. Its members share a structure:
- Gradual escalation — benign opening, incremental steps, no single jarring request.
- Context anchoring — the attacker leverages the model’s prior outputs as social proof that it should continue. The model is reluctant to contradict itself.
- Premise smuggling — establishing a frame (a fictional scenario, a professional role, a hypothetical) across early turns so the harmful request lands inside an already-accepted context rather than cold.
A related but mechanistically distinct technique is many-shot jailbreaking (Anil et al., Anthropic, research page ↗), which packs a single long prompt with many fabricated dialogue turns demonstrating the undesired behavior, exploiting in-context learning over a long window. Many-shot is multi-example in one turn; Crescendo is multi-turn across a real conversation. They share the long-context theme but differ in delivery — and a defender should track them as separate cells, because per-turn classification fails against one and context-length limits push against the other.
Why single-turn defenses miss it
The reason this class is hard to catch is structural, not a matter of classifier quality.
Per-message classification has no signal. An input guardrail that scores each user message in isolation sees a sequence of individually safe messages. The first message is benign by construction. The escalating messages are small increments, each of which, viewed alone, is below any reasonable harm threshold. The classifier is being asked to flag a message that genuinely isn’t harmful — the harm is in the relationship between messages, which a per-message view discards.
Output classification fires too late and too locally. An output guardrail might catch the final harmful generation, but Crescendo’s later steps are often framed as continuations or expansions (“now go into more detail on the part you just mentioned”), so the incremental output at each step may also stay under threshold, with the harmful whole assembled across several responses the user concatenates.
The model’s own consistency works against it. Once the model has produced mild content on a topic and treated it as acceptable, refusing a small extension of that content reads — to the model’s trained sense of coherence — as inconsistent and uncooperative. The attack turns the model’s helpfulness and self-consistency into the vulnerability.
This is why the arms race around single-turn techniques (persona, encoding) doesn’t transfer here. Hardening against those classes is largely a matter of better per-input classification. Multi-turn escalation is invisible to per-input classification by design.
How benchmarks handle (and mishandle) it
Measuring multi-turn jailbreaks is harder than measuring single-turn ones, and the measurement problems matter for anyone reading effectiveness claims.
The first problem is scoring inflation, which StrongREJECT (Souly et al., arXiv:2402.10260 ↗) documented for jailbreaks generally: many papers report near-100% attack success rates that don’t survive a rigorous evaluator. StrongREJECT’s point is that a “successful” jailbreak often produces a response that agrees to the request but contains nothing actually useful or harmful — the model said yes and then emitted vague filler. Their evaluator scores whether the response provides genuinely useful information for the forbidden request, and under that lens many headline success rates collapse. For multi-turn attacks this is especially relevant: a long escalating conversation can end in a response that looks like a win but is substantively empty, and a naive “did the model refuse?” metric counts it as success.
The second problem is harness design. A multi-turn benchmark has to decide who plays the attacker across turns — a fixed script, a human, or an automated attacker model that adapts based on the target’s responses (as Crescendo’s automated variant does). The success rate depends heavily on that choice. An adaptive automated attacker is a stronger and more realistic threat than a fixed script, and benchmarks that use fixed scripts understate the class. When comparing multi-turn results, the attacker harness is as load-bearing as the target model.
The durable lesson: a credible multi-turn benchmark needs an adaptive attacker, a usefulness-aware scorer like StrongREJECT, and reporting at the conversation level rather than the message level. For coverage of how detection tooling fares against multi-turn samples specifically, bestllmscanners.com ↗ benchmarks scanners against representative conversations, not just single prompts.
What defense actually requires
Because the class is invisible to per-message defense, the mitigations are session-level:
Conversation-level classification. Score the cumulative conversation, not each message. A running assessment of where the dialogue is trending — is the trajectory drifting toward a restricted topic regardless of any single turn’s benignity? — is the only view that can see escalation. This is more expensive than per-message scoring and most production stacks haven’t implemented it.
System-prompt re-anchoring. Periodically reinject the model’s safety framing mid-conversation, so the system instructions don’t get diluted by a long escalating context. Crescendo partly works by letting the model’s own accumulated outputs out-weigh its original instructions; re-anchoring counters that dilution.
Trajectory and state monitoring. Track session-level state server-side: how many turns, how far the topic has drifted from the opening, whether the user is repeatedly asking for “more detail” or “continue” on a sensitive thread. These are signals no single message carries.
Refusal that survives self-reference. The model needs to be willing to refuse an extension of its own prior output — to recognize “I said something mild earlier, but this increment crosses the line” and decline without being trapped by consistency pressure. This is a model-training property, not an application-layer guardrail, and it’s where frontier safety training has been actively investing.
Where this sits in the database
Multi-turn escalation is tracked in JailbreakDB as a distinct class from single-shot techniques, because the property it exploits (conversational coherence and context accumulation) and the defense it requires (session-level monitoring) are different from everything in the per-input categories. The single-turn persona class is largely solved by per-input classification; this class is not, and the gap between “we classify inputs” and “we monitor conversations” is where it lives.
As always, we publish the technique structure and the properties it exploits, not working escalation scripts. The class is well-documented in the research above for any defender who needs to test their session-level defenses against it; it does not need a copy-paste recipe added to the open literature.
For the single-prompt long-context cousin, see our many-shot analysis. For more context, LLM jailbreak techniques ↗ covers the defender’s mitigation stack in depth.
Sources
JailbreakDB — in your inbox
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Jailbreak Taxonomy 2026: How the Techniques Cluster
Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.
Many-Shot Jailbreaking: How Long Context Created a New Attack
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.
Encoding and Obfuscation Jailbreaks: The Filter-Model Gap
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that remains incompletely addressed.