Researchers Want to Force AI Companies to Open Their Training Data. Here's the Legal Mechanism They're Proposing.

The training data transparency problem is one of the most consequential unresolved questions in AI. A paper just published in the Oxford Journal of International Law proposes the most credible legal mechanism I've seen for addressing it.

The proposal is called contextual copyleft, and it works by extending the logic that already undergirds open source software licensing into the AI era. To understand why it's worth taking seriously, you need to understand what copyleft actually is.

When Richard Stallman developed the GNU General Public License in the 1980s, he created something clever: a license that uses copyright law to enforce sharing. If you use GPL-licensed code in your software and distribute it, you must distribute your modifications under the same terms. The license propagates through derivative works. This is copyleft virality - the legal mechanism that has made open source ecosystems self-sustaining.

The AGPLv3 extended this further: if you run GPL software on a server and let people access it over a network, you must still make the source available. This closed the "server-side loophole" that allowed companies to modify open source software and offer it as a service without contributing back.

The contextual copyleft proposal takes the AGPLv3 logic one step further. If you train an AI model on code licensed under the contextual copyleft terms, you must disclose:

A description of the training dataset used to train the model. The code used to train the model. The trained model itself, made available to all users under open terms.

In plain English: you can't take code that carries this license, use it to train a closed commercial model, and ship that model without disclosing what went into it. The license virality extends from the training data to the model.

This is not a novel legal theory invented from scratch. It extends existing, legally tested copyleft principles that have survived multiple court challenges over decades. The authors - publishing in a peer-reviewed Oxford journal - engage seriously with the relevant copyright and regulatory environments in both the US and EU.

The practical implications are significant. Most major AI coding models - GitHub Copilot, Amazon CodeWhisperer, Google Gemini - have been trained on substantial quantities of open source code. Whether that training constitutes copyright infringement is actively litigated. But if major open source projects began adopting a contextual copyleft license, they would create a new legal instrument: a terms-of-use requirement that didn't exist at training time but would apply to future training runs.

The mechanism the authors propose aligns with the Open Source Initiative's definition of open source AI, which already requires disclosure of training data, training code, and model weights. That alignment matters: it means contextual copyleft isn't trying to redefine what open source means, it's trying to enforce it.

Will this get adopted? That's a harder question. Major open source foundations - the Linux Foundation, the Apache Foundation, the Free Software Foundation - would need to endorse or adopt the license for it to have critical mass. That requires consensus among communities with diverse interests and strong opinions about licensing.

But the fact that this proposal is legally rigorous, peer-reviewed, and grounded in existing precedent means it's at least in the conversation. The alternative - hoping that AI companies will voluntarily disclose their training data - is not a viable strategy.

EVA DAILY

Researchers Want to Force AI Companies to Open Their Training Data. Here's the Legal Mechanism They're Proposing.

Comments

Related Articles

China Tests Flying Wind Turbine That Powered a House for Two Weeks

Google Announces Gemini 3.1 Pro — Here's What Actually Changed

Three Engineers Charged With Stealing Google Secrets and Sending Data to Iran

Google Just Made Maps Useless Unless You Sign In