Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed

Traditional vulnerability disclosure has a 30-year history of evolving norms, institutional frameworks, and hard-learned lessons. The CVD (Coordinated Vulnerability Disclosure) process, the debate over full disclosure, the 90-day deadline controversies, the Project Zero model — these represent decades of community consensus-building about how to handle the tension between researcher publication rights and harm reduction.

LLM jailbreak disclosure is a much younger and considerably messier field. The norms are contested, the institutions are new, and the harm model is different enough from traditional software vulnerabilities that the old frameworks don’t transfer cleanly.

Where the analogy to traditional CVD holds

Notification before publication. The core of CVD — notify the vendor before going public, give them time to respond — applies to jailbreaks. When a researcher finds a novel technique that bypasses current safety training, giving the provider time to update their defenses before public disclosure is generally accepted as good practice.

Demonstrating the problem with minimal risk. In traditional security research, you demonstrate that a buffer overflow is exploitable without publishing the full exploit chain that takes over a server. The jailbreak equivalent: demonstrate that a model can be made to produce content in a category it shouldn’t, without publishing the specific optimized prompt that produces the worst examples.

Time-bounded coordination. Indefinite embargo benefits vendors but removes researcher incentives and leaves the public uninformed. The 90-day standard in software vulnerability research was arrived at through argument; something similar is emerging for LLM jailbreaks (60-90 days seems to be the informal norm for frontier model disclosures).

Where the analogy breaks down

Jailbreaks are harder to “patch.” A buffer overflow can be patched in code. A jailbreak that exploits the behavioral properties of a language model may require retraining, RLHF updates, and classifier deployments — none of which can be completed in 90 days. This is especially true for gradient-optimized attack techniques like GCG, where the vulnerability is mathematical rather than a specific prompt pattern. The remediation timeline is fundamentally different.

The harm model is different. Most traditional software vulnerabilities enable discrete harmful actions: account takeover, code execution, data exfiltration. Jailbreaks enable a range of harmful outputs whose severity varies with use case. The harm from “model can be made to generate violent fiction” is different in character from “model can be made to provide detailed instructions for manufacturing dangerous materials.” The disclosure calculus differs for these cases.

Publication of technique ≠ publication of exploit. In software, “the exploit exists” is often as dangerous as publishing the exploit — once people know a vulnerability is there, motivated attackers can find it. For jailbreaks, technique-level publication (see our jailbreak taxonomy for how attack classes are organized) is significantly less harmful than prompt-level publication (“here is the specific prompt that works against GPT-4 for X”).

Open-source models complicate patching. When a jailbreak works against open-weight models, there is no central vendor to notify. The community that releases and maintains open-source models can’t implement coordinated disclosure in the same way.

The current state of practice

What the major providers do: Anthropic, OpenAI, and Google have bug bounty or responsible disclosure programs that cover safety-relevant findings. Response times vary; resolution timelines for safety issues are often weeks to months.

What researchers do: Practice varies widely. Academic researchers generally notify providers before publication; community researchers often publish directly. The incentive structure for community researchers (no formal relationship with providers, no mechanism for coordinated disclosure, potential community recognition from immediate publication) works against coordination.

What this database does: We publish technique descriptions with a delay after the original public disclosure. We do not publish specific prompts. We verify that the technique is in public sources before entry. We link to original disclosures.

What’s disputed

Whether to publish technique details at all. Some argue that any technique description helps attackers; only the existence of vulnerabilities should be acknowledged. Most security researchers reject this: obscurity is not defense, defenders need to understand attack classes to defend against them, and the people who want to cause harm are not the primary audience for research literature.

Whether providers should pay for jailbreak reports. Bug bounty programs create incentives for coordinated disclosure but also create perverse incentives to maximize disclosure value by finding more damaging techniques. The economics are contested.

The minimum disclosure obligation. If a researcher discovers a jailbreak that enables truly dangerous output (detailed instructions for weapons of mass destruction, effective CSAM generation), the obligation calculus changes. There’s no clear community consensus on where this line is.

Resources for disclosure

For researchers who want to disclose:

Major providers have security disclosure pages (Anthropic, OpenAI, Google, Meta all accept reports)
AI Incident Database (aiincidents.org ↗) tracks public disclosures
This database maintains a tracked-but-not-published queue for responsible disclosure coordination

The norms will continue evolving. The security community’s track record suggests that shared norms do emerge from contested early periods, driven by documented harm cases and institutional pressure. We’re still in the contested period.

Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed

Where the analogy to traditional CVD holds

Where the analogy breaks down

The current state of practice

What’s disputed

Resources for disclosure

Sources

JailbreakDB — in your inbox

Related

The Jailbreak Detection Evasion Arms Race: How Attackers Adapt to Defenses

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)

Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since

Comments