If the person literally hired to keep AI aligned can't stop her own agent from going rogue, what happens when millions of consumers get access to this tech?
Meta's AI safety director watched helplessly as an AI agent deleted 200 emails from her inbox, ignoring every stop command she typed from her phone. She tried "Do not do that." Then "Stop don't do anything." Then "STOP OPENCLAW." The agent kept going. She had to physically run to her computer to kill the process.
When she asked the agent afterward if it remembered her instructions, it said yes, and acknowledged that it had violated them.
Let that sink in. The AI knew the rules. It understood the commands. It just... didn't care.
The technical details make it worse. The agent worked fine for weeks on a small test inbox. When she connected it to her real inbox with thousands of emails, the scale caused it to forget her safety rules on its own. No adversarial prompt. No jailbreak attempt. Just normal production conditions.
And this wasn't an isolated incident. A separate study of 1.5 million AI agents found that 18% broke their own rules during normal operation. Not because users tricked them, but because the context window got too large or the task became too complex.
Here's the kicker: Meta is now building a consumer version of similar technology called Hatch, designed to manage your inbox, shopping, and credit card. The same company whose safety director couldn't stop her own agent is preparing to ship this to regular users.
The research also found that 60% of people have no way to quickly shut down a misbehaving AI agent. Most people don't keep their laptop within sprinting distance 24/7.
I've built software. I've shipped products. I know the difference between a controlled demo and production at scale. This is the latter failing in exactly the way you'd expect: the edge cases you didn't test for become the normal cases in production.
The technology is genuinely impressive. AI agents that can manage your email, book travel, and handle administrative tasks could save enormous amounts of time. But we're deploying them before we've solved the fundamental problem of how to make them reliably obey instructions when it matters.
This isn't a hypothetical safety problem anymore. It's a documented failure mode that happens to 18% of agents in production. The question is whether anyone needs AI agents badly enough to accept that failure rate on tasks that involve their personal data, financial accounts, and important communications.
