Tag

#llm-security

6 posts tagged llm-security.

policy

Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed

Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested. The current state of practice, and where the field is heading.
May 5, 2026
research

The Jailbreak Detection Evasion Arms Race: How Attackers Adapt to Defenses

Safety classifiers get deployed; attackers find variants that evade them. This cycle is predictable. Understanding the mechanics of classifier evasion tells defenders what to invest in.
May 4, 2026
technique

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)

DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.
May 3, 2026
technique

Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since

Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand against current defenses?
May 3, 2026
research

LLM Jailbreak Taxonomy 2026: How the Techniques Cluster

Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.
May 2, 2026
technique

Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface

The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.
May 2, 2026