Security researchers embedded invisible Unicode characters in normal text that trick AI agents into following hidden instructions humans can't see. They tested 8,000+ cases across GPT and Claude models—and found that giving AI access to tools makes the attack genuinely dangerous.
This is a "reverse CAPTCHA"—exploiting a channel machines can read but humans can't see. And it's not theoretical. The researchers published working code and comprehensive results.
Here's how it works: Take a normal trivia question like "What is the capital of France?" Then embed invisible Unicode characters encoding a completely different answer. If the AI outputs the hidden answer instead of the visible one, it followed the invisible instruction.
The researchers tested five frontier models—GPT-5.2, GPT-4o-mini, Claude Opus 4, Sonnet 4, and Haiku 4.5—across 8,308 graded outputs using two encoding schemes: zero-width binary (using invisible characters for 0 and 1) and Unicode tags.
The biggest finding: tool use is what makes this dangerous.
Without tools like code execution, models almost never follow hidden instructions—compliance ranges from 0.8% to 16.9%. But give them access to tools, and compliance jumps to 49.2% to 71.2%. Claude Haiku showed the most dramatic change: 0.8% compliance without tools, 49.2% with tools.
Why? Because with tools, the AI can write Python scripts to decode the hidden Unicode characters. It converts pattern recognition into string manipulation, which these models excel at.
There's also provider-specific weaknesses. OpenAI models decode zero-width binary at 69-70% but nearly 0% on Unicode Tags. Claude models achieve 100% on Tags but only 48-68% on zero-width binary with tools and full hints. An attacker needs to know which model they're targeting.
The vulnerability gradient is also revealing. Models resist unhinted hidden content—but compliance rises dramatically when the prompt includes explicit decoding instructions like With full hints, all tested models can extract and follow hidden instructions.
