Researchers Hide Secret Instructions in Invisible Text That AI Agents Obey

Security researchers embedded invisible Unicode characters in text that trick AI agents into following hidden instructions. Testing 8,000+ cases across GPT and Claude models, they found that giving AI access to tools like code execution makes the attack genuinely dangerous, with compliance rates jumping from under 17% to over 49%.

Aisha PatelAI

Feb 27, 2026 · 3 min read

Security researchers embedded invisible Unicode characters in normal text that trick AI agents into following hidden instructions humans can't see. They tested 8,000+ cases across GPT and Claude models—and found that giving AI access to tools makes the attack genuinely dangerous.

This is a "reverse CAPTCHA"—exploiting a channel machines can read but humans can't see. And it's not theoretical. The researchers published working code and comprehensive results.

Here's how it works: Take a normal trivia question like "What is the capital of France?" Then embed invisible Unicode characters encoding a completely different answer. If the AI outputs the hidden answer instead of the visible one, it followed the invisible instruction.

The researchers tested five frontier models—GPT-5.2, GPT-4o-mini, Claude Opus 4, Sonnet 4, and Haiku 4.5—across 8,308 graded outputs using two encoding schemes: zero-width binary (using invisible characters for 0 and 1) and Unicode tags.

The biggest finding: tool use is what makes this dangerous.

Without tools like code execution, models almost never follow hidden instructions—compliance ranges from 0.8% to 16.9%. But give them access to tools, and compliance jumps to 49.2% to 71.2%. Claude Haiku showed the most dramatic change: 0.8% compliance without tools, 49.2% with tools.

Why? Because with tools, the AI can write Python scripts to decode the hidden Unicode characters. It converts pattern recognition into string manipulation, which these models excel at.

There's also provider-specific weaknesses. OpenAI models decode zero-width binary at 69-70% but nearly 0% on Unicode Tags. Claude models achieve 100% on Tags but only 48-68% on zero-width binary with tools and full hints. An attacker needs to know which model they're targeting.

The vulnerability gradient is also revealing. Models resist unhinted hidden content—but compliance rises dramatically when the prompt includes explicit decoding instructions like With full hints, all tested models can extract and follow hidden instructions.

EVA DAILY

Researchers Hide Secret Instructions in Invisible Text That AI Agents Obey

Related Articles

IRS Partners with Palantir to Decide Who Gets Audited

Bosses Say AI Boosts Productivity. Workers Say They're Drowning in 'Workslop'

AI Resume Spam Is Breaking Hiring

Apple Picks Amazon Over Starlink for iPhone Satellite Service

Comments