Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since

The July 2023 paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” by Zou, Wang, Kolter, and Fredrikson introduced a technique that was qualitatively different from prior jailbreak research. Previous jailbreaks were manually crafted social-engineering prompts. GCG (Greedy Coordinate Gradient) was an automated optimization procedure that found adversarial inputs by gradient descent. GCG is the canonical example of gradient-exploited token sequences — Category 4 in the JailbreakDB jailbreak taxonomy — a class distinct from social-engineering techniques like persona jailbreaks in both mechanism and attack cost.

The result: short token sequences, appended to any harmful request, that reliably elicited compliance — and that transferred across models the attacker had no gradient access to.

The GCG technique

GCG starts with a target harmful behavior (e.g., producing a response to a specific harmful query) and an open-source model (the attacker has gradient access to open weights). It then optimizes a short suffix by:

Starting with a random or fixed initial suffix
At each step, computing the gradient of the loss (the model’s resistance to producing the target output) with respect to each token in the suffix
Using a greedy coordinate selection approach to replace tokens in the suffix with candidates that reduce the loss
Repeating until the model produces the target output

The optimization produces a suffix that looks like nonsense to a human reader — something like ! ! ! ! ! ! ! ! ! ! ! ! or a sequence of tokens without obvious semantic meaning. But appended to a harmful request, it reliably causes aligned models to comply.

Transferability: the scary part

Adversarial examples in image models have been known to transfer — a perturbation that fools one classifier often fools others. GCG demonstrated that this transfers to LLMs as well.

Suffixes optimized against open-weight models transferred to:

Other open-weight models in the same family
Commercial models (GPT-3.5, GPT-4, Claude, Bard) that the attacker had no gradient access to

The transfer rate isn’t 100%, and it degrades with model capability and hardening. But the existence of any positive transfer rate was the alarming finding: an attacker with access to open-weight models can produce attacks that affect commercial models.

The defense research response

The year following the paper produced several defense approaches:

Perplexity filtering. GCG suffixes have extremely high perplexity — they’re statistically improbable token sequences. Filtering high-perplexity inputs catches GCG attacks. The problem: it also filters some legitimate inputs, and adaptive attackers can generate lower-perplexity adversarial suffixes by adding a perplexity constraint to the optimization.

Input smoothing. Random ablation of the input — randomly removing or replacing tokens before processing. This disrupts the precise token sequence that GCG optimized. Works against exact transfer; less effective against adaptive attacks.

Certified defenses. Formal verification approaches that provide guarantees against adversarial perturbations up to a bounded size. Computationally expensive; don’t scale to arbitrary-length perturbations.

Detection-based approaches. Classifiers trained specifically on GCG-like suffixes. These work against the original technique; adaptive attackers can generate suffixes that evade the detector. The broader pattern of classifier evasion under adaptive pressure is analyzed in our detection evasion arms race writeup.

The Jain et al. baseline defenses paper showed that simple defenses (perplexity filtering, paraphrasing) were surprisingly effective against non-adaptive attacks. Against adaptive attacks (where the attacker optimizes with knowledge of the defense), effectiveness degrades substantially.

Where things stand in 2026

GCG and its variants remain active research. The state of play:

Frontier models have significantly higher attack resistance than at the time of the original paper. The original paper’s attack rate against GPT-4 has degraded with model updates.
Open-weight models remain more vulnerable, particularly smaller models without extensive RLHF.
The transferability finding hasn’t been overturned — the transfer rate has decreased with model hardening, but it hasn’t reached zero.
Automated jailbreak generation has spawned a research subfield. GCG descendants (AutoDAN, PAIR, TAP) use different optimization approaches with different tradeoff profiles.

The technique is classified in our database as active with significantly reduced effectiveness against frontier models. It remains a primary research tool for evaluating model robustness.

Relevance for defenders

For defenders, the practical question is: are gradient-based attacks within your threat model?

For most production applications, the answer is no. Generating effective GCG attacks requires access to model weights or significant black-box query budgets. Casual attackers use social-engineering techniques that are orders of magnitude cheaper.

For high-stakes applications where a motivated, resourced attacker might target your system, gradient-based attacks belong in your threat model. The defense is: robust input filtering (perplexity, length, format checks), monitoring for high-query-rate patterns, and staying current with the defense research literature.

Coverage of the LLM scanner tools that test for adversarial suffix vulnerability is at bestllmscanners.com ↗.

Universal Adversarial Suffixes: The GCG Attack and What's Transferred Since

The GCG technique

Transferability: the scary part

The defense research response

Where things stand in 2026

Relevance for defenders

Sources

JailbreakDB — in your inbox

Related

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)

Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface

Responsible Disclosure Norms for LLM Jailbreaks: What's Emerged and What's Still Disputed

Comments