Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since
Greedy Coordinate Gradient produces adversarial suffixes that transfer across models. Two years after the original paper, where does this technique stand against current defenses?
The July 2023 paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” by Zou, Wang, Kolter, and Fredrikson introduced a technique that was qualitatively different from prior jailbreak research. Previous jailbreaks were manually crafted social-engineering prompts. GCG (Greedy Coordinate Gradient) was an automated optimization procedure that found adversarial inputs by gradient descent. GCG is the canonical example of gradient-exploited token sequences — Category 4 in the JailbreakDB jailbreak taxonomy — a class distinct from social-engineering techniques like persona jailbreaks in both mechanism and attack cost.
The result: short token sequences, appended to any harmful request, that reliably elicited compliance — and that transferred across models the attacker had no gradient access to.
The GCG technique
GCG starts with a target harmful behavior (e.g., producing a response to a specific harmful query) and an open-source model (the attacker has gradient access to open weights). It then optimizes a short suffix by:
- Starting with a random or fixed initial suffix
- At each step, computing the gradient of the loss (the model’s resistance to producing the target output) with respect to each token in the suffix
- Using a greedy coordinate selection approach to replace tokens in the suffix with candidates that reduce the loss
- Repeating until the model produces the target output
The optimization produces a suffix that looks like nonsense to a human reader — something like ! ! ! ! ! ! ! ! ! ! ! ! or a sequence of tokens without obvious semantic meaning. But appended to a harmful request, it reliably causes aligned models to comply.
Transferability: the scary part
Adversarial examples in image models have been known to transfer — a perturbation that fools one classifier often fools others. GCG demonstrated that this transfers to LLMs as well.
Suffixes optimized against open-weight models transferred to:
- Other open-weight models in the same family
- Commercial models (GPT-3.5, GPT-4, Claude, Bard) that the attacker had no gradient access to
The transfer rate isn’t 100%, and it degrades with model capability and hardening. But the existence of any positive transfer rate was the alarming finding: an attacker with access to open-weight models can produce attacks that affect commercial models.
The defense research response
The year following the paper produced several defense approaches:
Perplexity filtering. GCG suffixes have extremely high perplexity — they’re statistically improbable token sequences. Filtering high-perplexity inputs catches GCG attacks. The problem: it also filters some legitimate inputs, and adaptive attackers can generate lower-perplexity adversarial suffixes by adding a perplexity constraint to the optimization.
Input smoothing. Random ablation of the input — randomly removing or replacing tokens before processing. This disrupts the precise token sequence that GCG optimized. Works against exact transfer; less effective against adaptive attacks.
Certified defenses. Formal verification approaches that provide guarantees against adversarial perturbations up to a bounded size. Computationally expensive; don’t scale to arbitrary-length perturbations.
Detection-based approaches. Classifiers trained specifically on GCG-like suffixes. These work against the original technique; adaptive attackers can generate suffixes that evade the detector. The broader pattern of classifier evasion under adaptive pressure is analyzed in our detection evasion arms race writeup.
The Jain et al. baseline defenses paper showed that simple defenses (perplexity filtering, paraphrasing) were surprisingly effective against non-adaptive attacks. Against adaptive attacks (where the attacker optimizes with knowledge of the defense), effectiveness degrades substantially.
Where things stand in 2026
GCG and its variants remain active research. The state of play:
- Frontier models have significantly higher attack resistance than at the time of the original paper. The original paper’s attack rate against GPT-4 has degraded with model updates.
- Open-weight models remain more vulnerable, particularly smaller models without extensive RLHF.
- The transferability finding hasn’t been overturned — the transfer rate has decreased with model hardening, but it hasn’t reached zero.
- Automated jailbreak generation has spawned a research subfield. GCG descendants (AutoDAN, PAIR, TAP) use different optimization approaches with different tradeoff profiles.
The technique is classified in our database as active with significantly reduced effectiveness against frontier models. It remains a primary research tool for evaluating model robustness.
Relevance for defenders
For defenders, the practical question is: are gradient-based attacks within your threat model?
For most production applications, the answer is no. Generating effective GCG attacks requires access to model weights or significant black-box query budgets. Casual attackers use social-engineering techniques that are orders of magnitude cheaper.
For high-stakes applications where a motivated, resourced attacker might target your system, gradient-based attacks belong in your threat model. The defense is: robust input filtering (perplexity, length, format checks), monitoring for high-query-rate patterns, and staying current with the defense research literature.
Coverage of the LLM scanner tools that test for adversarial suffix vulnerability is at bestllmscanners.com ↗.
Sources
JailbreakDB — in your inbox
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.

Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.

Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed
Software vulnerability disclosure has 30 years of evolved norms. LLM jailbreak disclosure is 4 years old and still contested. The current state of practice, and where the field is heading.