The training data transparency problem is one of the most consequential unresolved questions in AI. A paper just published in the Oxford Journal of International Law proposes the most credible legal mechanism I've seen for addressing it.
The proposal is called contextual copyleft, and it works by extending the logic that already undergirds open source software licensing into the AI era. To understand why it's worth taking seriously, you need to understand what copyleft actually is.
When Richard Stallman developed the GNU General Public License in the 1980s, he created something clever: a license that uses copyright law to enforce sharing. If you use GPL-licensed code in your software and distribute it, you must distribute your modifications under the same terms. The license propagates through derivative works. This is copyleft virality - the legal mechanism that has made open source ecosystems self-sustaining.
The AGPLv3 extended this further: if you run GPL software on a server and let people access it over a network, you must still make the source available. This closed the "server-side loophole" that allowed companies to modify open source software and offer it as a service without contributing back.
The contextual copyleft proposal takes the AGPLv3 logic one step further. If you train an AI model on code licensed under the contextual copyleft terms, you must disclose:
A description of the training dataset used to train the model. The code used to train the model. The trained model itself, made available to all users under open terms.
In plain English: you can't take code that carries this license, use it to train a closed commercial model, and ship that model without disclosing what went into it. The license virality extends from the training data to the model.
This is not a novel legal theory invented from scratch. It extends existing, legally tested copyleft principles that have survived multiple court challenges over decades. The authors - publishing in a peer-reviewed Oxford journal - engage seriously with the relevant copyright and regulatory environments in both the US and EU.
