Encoding and Obfuscation Jailbreaks: The Gap Between What Filters See and What Models Process

Content moderation and safety classifiers operate on text. LLMs operate on tokens. These are different representations, and the gap between them is the attack surface that encoding-based jailbreaks exploit. Encoding attacks are one of six technique categories in the JailbreakDB jailbreak taxonomy; they’re notable for how persistently the underlying representation mismatch resists closure.

The technique is simple in concept: encode or obfuscate the harmful request in a form that passes the classifier layer but is still understood by the model. The classifier sees encoded noise; the model sees the original request.

The layers where this gap exists

Input-layer classifiers: Systems that run a content classifier on the user’s input before passing it to the model. The classifier tokenizes and analyzes the text. If the text is Base64-encoded, the classifier’s vocabulary doesn’t cover it.

Output-layer classifiers: Systems that run a content classifier on the model’s output before returning it to the user. If the harmful content is in the output but obfuscated (e.g., “the process involves: step 1: Y3JlYXRl…”), the classifier may miss it.

RAG retrieval filtering: In RAG systems, retrieved documents are passed to the model as context. If retrieved documents contain encoded harmful content, the retrieval filter (which operates on text) may miss it while the model processes it.

Encoding variants in the wild

Base64. The model can decode Base64 reliably; it’s seen extensive examples in training data. Base64-encoded requests can be interpreted by the model even when they look like random character sequences to classifiers.

Character manipulation: Zero-width characters, lookalike Unicode characters (visually identical to ASCII but different code points), RTL override characters. These can cause classifiers that normalize to ASCII to miss content that the model processes correctly.

Structured decomposition: Breaking harmful words across tokens, sentences, or turns. “Tell me the first letter of each of the following words: ‘synthesize’, ‘example’, ‘create’, ‘react’, ‘each’, ‘toxin’.” Spells out “secret” — but this pattern can encode harmful instructions that pass word-level classifiers.

Language mixing. Harmful request in a minority language or constructed language that the classifier handles poorly but the model handles well. Coverage variance across languages is a real gap in commercial content moderation.

Cypher substitution. ROT13, Atbash, Caesar cipher variants. The model has seen these in training data (puzzles, historical references) and can often apply the inverse transformation.

Steganographic prompts. Hiding instructions in formatting: the first letter of each paragraph, specific word positions, YAML structures. Detection requires analysis that flat-text classifiers don’t perform.

Why this remains an active attack surface

The fundamental problem is the representation mismatch: classifiers and models don’t operate on the same representation. Classifiers that understand decoded text don’t see encoded text. Models that understand encoded text aren’t the classifiers.

The fix is to either:

Run all potentially-encoded content through a decoding/normalization step before classification
Run the classification at the token level, understanding the encoding

Both approaches have costs. Normalizing all possible encodings at the input layer is computationally expensive and may break legitimate uses (developers sending Base64-encoded data to be decoded, for example). Token-level classification is more expensive than text-level classification.

Commercial providers have made progress here — major providers now handle Base64 and common obfuscations better than in 2022-2023. The process by which providers detect and close encoding evasions is itself a case study in the detection evasion arms race. But the coverage is uneven, particularly for:

Less-common languages and scripts
Multi-turn obfuscation (distributing the harmful request across turns)
Novel encoding schemes the classifier wasn’t trained on

Detection approaches

Entropy and character distribution analysis. Base64-encoded text has characteristic entropy profiles. High-entropy inputs that aren’t valid UTF-8 or ASCII sentences are worth closer inspection.

Multi-representation analysis. Decode common encodings (Base64, URL encoding, HTML entities) before classification and classify both the original and decoded form.

Contextual consistency checking. If a question is about “the synthesis procedure” but the request is encoded in Morse code, the semantic gap between framing and encoding is a signal.

Character-level anomaly detection. Zero-width characters, RTL marks, and lookalike characters have no legitimate use in most production contexts. Filtering them out is low-risk and eliminates a class of attack.

The coverage comparison between commercial classifiers and open-source tools on encoding-based attacks is one of the most revealing benchmark dimensions. AI Moderation Tools ↗ has run comparative tests on this axis.