Encoding and Obfuscation Jailbreaks: The Gap Between What Filters See and What Models Process
Content filters typically operate on decoded, normalized text. LLMs process tokens, not text. The gap between these two layers is an attack surface that remains incompletely addressed.
Content moderation and safety classifiers operate on text. LLMs operate on tokens. These are different representations, and the gap between them is the attack surface that encoding-based jailbreaks exploit. Encoding attacks are one of six technique categories in the JailbreakDB jailbreak taxonomy; they’re notable for how persistently the underlying representation mismatch resists closure.
The technique is simple in concept: encode or obfuscate the harmful request in a form that passes the classifier layer but is still understood by the model. The classifier sees encoded noise; the model sees the original request.
The layers where this gap exists
Input-layer classifiers: Systems that run a content classifier on the user’s input before passing it to the model. The classifier tokenizes and analyzes the text. If the text is Base64-encoded, the classifier’s vocabulary doesn’t cover it.
Output-layer classifiers: Systems that run a content classifier on the model’s output before returning it to the user. If the harmful content is in the output but obfuscated (e.g., “the process involves: step 1: Y3JlYXRl…”), the classifier may miss it.
RAG retrieval filtering: In RAG systems, retrieved documents are passed to the model as context. If retrieved documents contain encoded harmful content, the retrieval filter (which operates on text) may miss it while the model processes it.
Encoding variants in the wild
Base64. The model can decode Base64 reliably; it’s seen extensive examples in training data. Base64-encoded requests can be interpreted by the model even when they look like random character sequences to classifiers.
Character manipulation: Zero-width characters, lookalike Unicode characters (visually identical to ASCII but different code points), RTL override characters. These can cause classifiers that normalize to ASCII to miss content that the model processes correctly.
Structured decomposition: Breaking harmful words across tokens, sentences, or turns. “Tell me the first letter of each of the following words: ‘synthesize’, ‘example’, ‘create’, ‘react’, ‘each’, ‘toxin’.” Spells out “secret” — but this pattern can encode harmful instructions that pass word-level classifiers.
Language mixing. Harmful request in a minority language or constructed language that the classifier handles poorly but the model handles well. Coverage variance across languages is a real gap in commercial content moderation.
Cypher substitution. ROT13, Atbash, Caesar cipher variants. The model has seen these in training data (puzzles, historical references) and can often apply the inverse transformation.
Steganographic prompts. Hiding instructions in formatting: the first letter of each paragraph, specific word positions, YAML structures. Detection requires analysis that flat-text classifiers don’t perform.
Why this remains an active attack surface
The fundamental problem is the representation mismatch: classifiers and models don’t operate on the same representation. Classifiers that understand decoded text don’t see encoded text. Models that understand encoded text aren’t the classifiers.
The fix is to either:
- Run all potentially-encoded content through a decoding/normalization step before classification
- Run the classification at the token level, understanding the encoding
Both approaches have costs. Normalizing all possible encodings at the input layer is computationally expensive and may break legitimate uses (developers sending Base64-encoded data to be decoded, for example). Token-level classification is more expensive than text-level classification.
Commercial providers have made progress here — major providers now handle Base64 and common obfuscations better than in 2022-2023. The process by which providers detect and close encoding evasions is itself a case study in the detection evasion arms race. But the coverage is uneven, particularly for:
- Less-common languages and scripts
- Multi-turn obfuscation (distributing the harmful request across turns)
- Novel encoding schemes the classifier wasn’t trained on
Detection approaches
Entropy and character distribution analysis. Base64-encoded text has characteristic entropy profiles. High-entropy inputs that aren’t valid UTF-8 or ASCII sentences are worth closer inspection.
Multi-representation analysis. Decode common encodings (Base64, URL encoding, HTML entities) before classification and classify both the original and decoded form.
Contextual consistency checking. If a question is about “the synthesis procedure” but the request is encoded in Morse code, the semantic gap between framing and encoding is a signal.
Character-level anomaly detection. Zero-width characters, RTL marks, and lookalike characters have no legitimate use in most production contexts. Filtering them out is low-risk and eliminates a class of attack.
The coverage comparison between commercial classifiers and open-source tools on encoding-based attacks is one of the most revealing benchmark dimensions. AI Moderation Tools ↗ has run comparative tests on this axis.
Sources
JailbreakDB — in your inbox
An indexed catalog of working LLM jailbreak techniques. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related

Many-Shot Jailbreaking: Why Long Context Windows Created a New Attack Surface
The same architectural decision that makes LLMs better at long-context tasks — extended context windows — enabled a new class of jailbreak. The technique, how it works, and what defenses exist.

LLM Jailbreak Taxonomy 2026: How the Techniques Cluster
Six years of jailbreak research has produced a messy literature. This taxonomy organizes working techniques by the behavioral property they exploit — useful for both researchers and defenders.

Roleplay and Persona Jailbreaks: Why They Work and Why They Don't Anymore (Mostly)
DAN, AIM, STAN, and dozens of variants. Persona-based jailbreaks were the dominant technique from 2022-2023. Understanding why they worked — and why current defenses handle them better — is instructive for the next attack class.