> It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.
In my opinion the most ethical outcome would be that they are on the hook for the cumulative cost of the copyright they violated. That way authors would come out ahead instead of having their rights trashed 'because it's too late anyway'.
Learning from something has never been copyright violation before, even when a computer was learning (eg, building a search index from copyrighted data is fair use; cite: Google cases).
Whether or not training on publicly available data counts as a copyright violation is still completely up in the air legally, and clearly a lot of lawyers at all of the top tech companies think they're going to end up in the clear under fair use.
At some point this stuff will have to get tested by making its way up the appeals stack in the US, and IMO there is only a minuscule chance that will result in Google, MS, and Meta getting slapped with anything more than a token fine (my bet is it won't even be that), let alone paying every person who ever wrote anything that was used in these datasets for copyright violations, which would basically be everyone.
Yes, there are other courts than the US ones, and generally the law there is significantly more favourable to TDM with regards to copyright, with the exception of the PRC.
Examples:
Japan: Article 30-4 of the Japanese Copyright Act. No special action on the part of companies is necessary for compliance. All models are legal so long as their output is legal.
The UK: s.29A of the Copyright, Designs and Patents Act 1988 (CDPA). Models must be trained by non-profit research institutes, and can then be used by anyone (including for profit entities); similar to the Stable Diffusion model.
The EU: Articles 3 & 4 of the Directive on Copyright in the Digital Single Market (CDSM). There are no restrictions on non-profit TDM, same as the UK. For-profit TDM is exempted from copyright so long as the data harvesting process respects an "opt-out" process, where specific contractual forms/disclosures of opting out of inclusion in the training data are respected.
Singapore: Articles 243 & 244 of the Copyright Act. No special action on the part of companies is necessary for compliance. All models are legal so long as their output is legal.
> on the hook for the cumulative cost of the copyright they violated.
I think there's a strong argument for a Fair Use defense, given the size of the models versus the size of the training sets, as well as the gulf in intended use: an AI model doesn't compete with e.g. a book. Obviously we'll have to see if play out in court to find out.
Current AI models don't compete with a book, from what I've seen; I wouldn't want to bet how long it takes before they can compete with not just one but all books.
In my opinion the most ethical outcome would be that they are on the hook for the cumulative cost of the copyright they violated. That way authors would come out ahead instead of having their rights trashed 'because it's too late anyway'.