Could you share more about copyright? For example, aren't you worried that now, ...

sillysaurusx · on July 11, 2023

I think a lot of hackers shy away from doing impactful work because of fear. Sometimes those fears are justified, but it's remarkable how often things that seem like a big deal turn out not to matter. My advice for ambitious devs would be to do what seems interesting, and don't worry too much about threatening letters. Usually the worst thing that happens is that you agree to stop doing whatever generated the threat.

Personally, I'm not worried. It would be a damn shame if academics come under fire merely for trying to operate on the cutting edge of science. None of us were trying to make money; we just wanted to make something interesting.

> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

Thanks! I think we might be putting up a website for it soon, if only to explain ourselves. In the meantime – I hate this phrase, since I don't want followers – the only way to keep informed is to follow my Twitter, and perhaps keep an eye on my HN comments.

You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.

jacquesm · on July 11, 2023

What exactly is it that you claim copyright over? Are you sure that you have standing to bring that suit?

sillysaurusx · on July 11, 2023

It’s the other way around — Meta DMCA’d llama-dl, my github repo, claiming they control copyright of llama. Our assertion is that ML weights are uncopyrightable, much like a phone book - training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time.

I can send you the draft we’ve prepared if you’re interested — drop me an email. But I’ll probably set up a site for this, if only to clear up our motives and expected outcomes.

jacquesm · on July 12, 2023

Ah, I mis-interpreted 'I’ll be participating in a legal action against Meta' as you guys bringing a counter suit of some sort. Thanks for clearing that up.

Copyright lawsuits are usually a case of who has the biggest stamina and hence who has the biggest wallet. Your funding will be a very important part of the outcome, regardless of the legal merits of your defense. You may want to get out of any kind of control of GH because they are strongly connected to OpenAI through Microsoft and hence has a stake in getting rid of any reasonably competent open source LLM.

Make sure you know what you are in for, lawsuits with large counterparties are a rodeo and even if you win they can make your life miserable with endless appeals. You will have to be prepared to spend years on this. Much good luck and if you set up a site do post the link.

nickpsecurity · on July 18, 2023

"training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time."

Couldn't you use a similar type of argument to say that different implementations of the same software API are basically the same software even though the instructions are completely different each time? And, since they do the same thing (aka compatible), that software implementations can't be subject to copyright for that reason? I don't think that holds up, esp in a pro-copyright country.

Here's one I came up with in case your lawyers can use it. My original goal was a license for proprietary content to be used in LLM's where the creators were worried about verbatim extraction or whether their content was sufficiently mixed in with other data. It was about motivating them to let us train on such data. I'll start with those terms:

"1. Percentage of total data. The copywritten work must not be larger than N% of total, training data put into a model. If it's tiny enough, one might be able to argue it only adds so much weigh to the outputs. What if it's the only data of its type, though?

2. Merged with similar data. The copywritten work must be one of multiple examples of the same types of data. For instance, there might be many examples given to the model about what files are, how to generate them, doing it in Python, and specific examples in Python. When it generates Python code, any or all of this might have contributed to it.

3. Ratio of data, set size to number of parameters. The content owners might want the training data to exceed the number of parameters by a multiplier N. For instance, at least 10GB or 100GB going into a 1G model. The multiplier is 10.

4. Diverse data. The content owner might want a wide range of data on many topics to go into the model. They might even specify certain data sets, a minimum number of topics, or even a number of word vectors per word used (their keywords). Once again, the odds the model is just repeating one piece of data goes down as the number of data and similar words in the model goes up."

So, basically you'd be trying to set a standard where anything the model creators legally have access to that they can put into their LLM's. Are the LLM's then carrying their I.P. or something novel? If novel, we're safe from lawsuits. If LLM's and outputs are not copyrightable, we'd be double safe in that situation.So, maybe use criteria like the above to decide what's novel where anything within certain numbers or combinations would be novel automatically by law or court precedent. What do you think?

Der_Einzige · on July 11, 2023

Getting sued is straight up a good thing for most peoples careers in tech. Haven't you watched silicon valley?