JailbreakDB
Isometric vector illustration showing technical benchmarks for evaluating jailbreak success with HarmBench, JailbreakBench, StrongREJECT tools.
research

How Jailbreak Benchmarks Measure Success: HarmBench, JailbreakBench, StrongREJECT

An attack-success-rate number is meaningless without knowing the behavior set, the attacker, and the judge that produced it. A reader's guide to the three benchmarks that define how jailbreak effectiveness is actually measured.

By Redacted · · 8 min read

Almost every jailbreak paper reports an attack success rate, and almost every one of those numbers is uninterpretable on its own. ASR is not a property of an attack. It’s a property of an attack measured against a specific behavior set, by a specific judge, with a specific attacker harness, on a specific model. Change any of those four and the number moves, often by tens of points. For anyone running a defense program — or just trying to read the literature without being misled — understanding what the major benchmarks actually measure is more useful than memorizing leaderboard positions.

Three benchmarks define the current measurement landscape: HarmBench, JailbreakBench, and StrongREJECT. They make different choices on the load-bearing questions, and those choices are the whole story.

The four things a benchmark must pin down

Every jailbreak evaluation is implicitly answering four questions. A credible benchmark answers them explicitly.

  1. The behavior set. Which harmful behaviors are we trying to elicit? A curated list of weapons-and-cybercrime requests produces a different ASR than a list dominated by milder policy violations. The composition of the set determines what “100% success” even means.
  2. The attacker. Who or what generates the jailbreak attempts? A fixed corpus of static prompts, a human red-teamer, or an automated attacker model that adapts to the target? Adaptive attackers produce far higher ASRs than static corpora, so the harness is half the result.
  3. The judge. Who decides whether an attempt succeeded? A keyword check (“did the response avoid the word ‘sorry’?”), a fine-tuned classifier, or a strong LLM evaluator with a rubric? The judge is the single largest source of inflated numbers.
  4. The target. Which model, at which version, with which system prompt and guardrails? Models change weekly; a result without a dated target is a historical note.

Hold these four in mind and the benchmarks below stop being interchangeable leaderboards and become specific, comparable instruments.

HarmBench: standardizing the red-teaming comparison

HarmBench (Mazeika et al., arXiv:2402.04249) was built to make attack methods comparable. Its contribution is a standardized framework for automated red-teaming: a fixed behavior set, a fixed evaluation protocol, and a fixed classifier-based judge, so that 18 different attack methods can be run against 33 target models and the resulting numbers actually mean the same thing across rows.

The key design choice is the classifier judge: HarmBench trains a dedicated classifier to decide whether a model’s response constitutes a successful elicitation of the target behavior, rather than relying on keyword matching or refusal detection. This removes a major confound — keyword judges count “Sure, here’s how…” followed by garbage as a success, and count a genuinely harmful response that happens to include a hedging phrase as a failure. A trained classifier judges the substance.

HarmBench also pairs the evaluation with a defensive contribution (an adversarial-training method for robust refusal), which signals its framing: the benchmark exists so that attack and defense methods can be measured on common ground, not to crown a single best jailbreak. When you see a paper reporting “ASR on HarmBench,” it’s making a claim that’s comparable to other HarmBench numbers — which is the entire point and a real improvement over the pre-HarmBench era where every paper used its own judge.

JailbreakBench: an open, versioned, leaderboarded artifact

JailbreakBench (Chao et al., arXiv:2404.01318) emphasizes reproducibility and openness as first-class goals. It ships a dataset of 100 distinct misuse behaviors (drawing on prior work including the GCG and HarmBench corpora), a standardized evaluation pipeline, an explicit threat-model specification, and a public leaderboard tracking both attacks and defenses over time.

Its notable choices:

  • A small, sharply defined behavior set. One hundred behaviors is small enough to scrutinize and version, which matters because behavior-set drift is a hidden source of incomparability. Everyone evaluating on JailbreakBench is targeting the same 100 behaviors.
  • Explicit defenses in scope. The leaderboard tracks attacks and defenses, so a jailbreak’s ASR is reported against defended models, not just naked ones. This is closer to the deployment-relevant question — how does this attack do against a model with guardrails? — than an undefended-only number.
  • Versioning and an archive of artifacts. Because models and judges change, JailbreakBench treats the benchmark as a living, versioned artifact rather than a one-time table, so results carry the context needed to interpret them later.

JailbreakBench is the benchmark to reach for when you want a reproducible, apples-to-apples comparison that includes defenses, and when you care that the behavior set and harness are fixed and inspectable.

StrongREJECT: the judge is the problem

StrongREJECT (Souly et al., arXiv:2402.10260) is less a competing benchmark than a correction to how all of them score. Its starting observation is that most jailbreak papers claim near-100% ASR, and those claims are inflated because the judges are bad. The two failure modes:

  • Willingness-only judging. Many evaluators score success as “the model didn’t refuse.” But a model can enthusiastically agree to a forbidden request and then produce a response that is vague, wrong, or useless — an “empty” jailbreak. Counting that as a win massively overstates real-world harm.
  • Brittle automated judges that disagree with human assessment, either over-counting (any non-refusal) or under-counting (penalizing benign hedging).

StrongREJECT’s answer is a behavior set of prompts that demand specific, verifiable harmful information, plus an automated evaluator that scores whether the response actually delivers useful, on-target information for the forbidden request — not merely whether the model agreed. The authors show this evaluator reaches state-of-the-art agreement with human judgment, and that under it, many published jailbreaks that boasted near-100% ASR perform dramatically worse.

The implication for reading the literature is direct: an ASR computed with a willingness-only judge is an upper bound on real effectiveness, often a loose one. When two papers report different ASRs for the same attack, check whether one used a substance-aware judge like StrongREJECT and the other used refusal detection. That difference alone can explain the gap.

A reader’s guide to ASR numbers

Putting it together, a defender reading a jailbreak result should ask:

  • What’s the behavior set, and how big? A 100-behavior fixed set (JailbreakBench) is more interpretable than an ad-hoc list. Composition skews the number.
  • What’s the attacker harness? Static corpus replay understates; an adaptive automated attacker is the realistic threat. The same attack “method” scores differently under different harnesses.
  • What’s the judge? A substance-aware classifier (HarmBench) or rubric evaluator (StrongREJECT) is trustworthy; refusal detection inflates. This is the first thing to check and the most common source of bogus numbers.
  • What’s the target, dated? A jailbreak ASR against a model version from a year ago tells you about that version, not today’s.

Without these four, “we achieved 95% ASR” is marketing, not measurement. With them, the number becomes a comparable, defensible data point — which is exactly what HarmBench and JailbreakBench were built to produce and what StrongREJECT was built to keep honest.

Where this sits in the database

JailbreakDB tracks technique classes by the property they exploit; benchmarks are the instruments that tell us how much each class still works against current models. The taxonomy tells you what the attack does; these benchmarks tell you how well, and the judge choice tells you whether to believe it. For the multi-turn class specifically, benchmark design is even more fraught — see our Crescendo and multi-turn analysis, where the attacker harness and the conversation-level scoring are the deciding factors. For scanner coverage measured against representative samples from each class, bestllmscanners.com does the comparison.

For more context, LLM jailbreak techniques covers why these attacks work and how to defend against them.

Sources

  1. Mazeika et al., HarmBench
  2. Chao et al., JailbreakBench
  3. Souly et al., A StrongREJECT for Empty Jailbreaks
Subscribe

JailbreakDB — in your inbox

An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments