Anthropic's 'Secret Plan' to Scan Every Book in the World Revealed in Unredacted Court Files

Unredacted legal documents have revealed that Anthropic, the AI company behind Claude, planned to "destructively scan all the books in the world" for AI training data. The files, which emerged from ongoing litigation, expose the gap between what AI companies say publicly about respecting copyright and what they discuss internally.

I've watched enough startups talk their way around IP issues to know when the lawyers are nervous. This is the AI training data story stripped of all the PR polish — raw internal documents showing exactly how these companies view the world's creative output.

The documents indicate that Anthropic ramped up these scanning plans in early 2024, with the stated purpose of teaching their AI tool "how to write well." The phrase "destructively scan" is particularly revealing. It suggests a process where physical books would be digitized in a way that damages or destroys the original — typically by cutting the binding to feed pages through high-speed scanners.

Here's what makes this different from previous AI training controversies: it's not speculation, it's not leaked chat logs, it's not anonymous sources. These are court documents. Legal filings that survived redaction. The kind of evidence that can't be dismissed as misunderstanding or taken out of context.

The technology is impressive. Training AI on vast corpuses of well-written text genuinely does improve output quality. I've seen the results firsthand — Claude is a sophisticated system that demonstrates nuanced understanding of language and context.

The question is whether anyone gave them permission.

Publishers have been suing AI companies for months over training data, but those cases typically involve digital content that was scraped from the internet. This plan appears to target physical books — the kind you'd find in libraries and bookstores, protected by copyright, written by authors who never consented to their work being fed into AI training pipelines.

Anthropic has positioned itself as the "responsible" AI company. They publish extensive research on AI safety. They emphasize Constitutional AI and alignment. They talk about doing things the right way.

But these documents suggest that when it comes to training data, "the right way" meant the same thing it meant for everyone else: take what you need, ask for forgiveness if you get caught.

The real tell is that these plans were described as "secret." Not confidential. Not proprietary. Secret. When a company describes its own internal plans as secret, it knows what it's doing would be controversial.

Every AI company faces the same fundamental problem: you need massive amounts of high-quality data to train competitive models. Human-written text is the gold standard. But most high-quality writing is copyrighted. So you either license it, which is expensive and slow, or you take it and argue about it later.

Anthropic chose option two. The only difference is we now have the receipts.