The Pile: An 800GB dataset of diverse text for language modeling (2020)

sillysaurusx · on July 11, 2023

Author here. And by author I mean I created books3 (the books component of The Pile) while everyone else did the hard work of actually writing the paper, ha. Stella and Leo Gao in particular did so much wonderful work on the paper, though it couldn’t have happened without everyone’s contributions.

As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.

There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.

As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: https://twitter.com/theshawwn/status/1641804013791215619?s=6.... They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.

One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/

andy99 · on July 11, 2023

Hi Shawn, re your side note, I disagree with you that we'd be better off if weights couldn't be copyrighted - basically because copyright gives options like GPL that can keep models open, otherwise we're just going to see everything good disappear behind trade secret. That said I fully support your "civil disobedience" in sharing the weights. I don't expect you to agree, but take a look at something I just wrote about this yesterday: http://marble.onl/posts/model_weight_copyrights.html . I'm happy to chat about it if you're interested.

nullc · on July 12, 2023

Even RMS himself prefers the abolition of copyright over the existence of the GPL.

Besides, it's already pretty unambiguous that weights are not copyrightable: they're a result of a mechanical process. The only original creative input that goes in to the weights is the unfathomable amounts of content scraped from other sources that aren't the authorship of the models. The objective of the gradient descent is simply minimizing loss on the training data.

Facebook doesn't own the llama model weights any more than the Bridgeman Art Library practically owns the paintings of European masters because they made quality scans of them. ( https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel.... ), or any more than Rural Telephone owns the phone directory ( https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R.... ).

Trying to make model weights copyrightable is going uphill, and I don't see how you get there without first establishing that the these LLM are unlawful derivatives of a countless number of copyrighted works along the way. Doing so would probably create a immediate monopoly for legally created LLMs for the hand full of corporations with quasi-monopoly content hosting services (facebook, google, etc) that can (and/or already have) stuffed licensing into their terms of use.

Do you want a cyberpunk dystopia? I think creating an AI monopoly is how you get a cyberpunk dystopia -- and the two ways we end up with one is either outright restrictions on private development of AI like some have been lobbying for and the other is the extension of copyright so that only a few entities can get access to enough of other people's data at a low enough cost to train them.

sillysaurusx · on July 11, 2023

It might seem like I’m entrenched in my position, but it’s quite the opposite — the only reason I’m doing this is because I really believe it’s the best outcome for devs in the long run. I’m open to changing my mind and pulling the plug in everything.

I’ll read over your essay and give it some thought. There are a bunch of subtle aspects to consider; I’ve been thinking it over for about four months now and still haven’t covered all the territory yet.

It feels like this may be one of the most important decisions going forward — both from an intellectual property point of view, and an individual rights perspective. E.g. you mention that it’s civil disobedience to share the weights, but it feels like if someone is claiming to do open science (LLaMA), sharing the research materials is the minimum requirement. Plus look how it’s benefited them; they’ve captured most of the open source LLM mindshare. So it seems likely that this will lead to more open source work in the long run, not less.

Feel free to chat! You can DM me on Twitter or email me. I’ve been in the hospital with my wife for 7 weeks, with two to go, so I’ve been a bit less responsive than I usually am.

js8 · on July 12, 2023

> otherwise we're just going to see everything good disappear behind trade secret

We will see that anyway. All the code I work on commercially is copyrighted and yet a trade secret. Existence of copyright (with the exception of copyleft, but that's subversion) didn't help software to be open sourced.

IMHO allowing models to be copyrighted is basically 18th century enclosures again.

YetAnotherNick · on July 12, 2023

> All the code I work on commercially is copyrighted

What do you mean by that? Do you continuously copyright the changes?

js8 · on July 12, 2023

Yes, more or less. I am not really sure why we legally do that, I believe it's just another protection in case someone actually copies the code.

norgie · on July 12, 2023

I mean everything is copyrighted anyway. It's harder to give up copyright than to keep it, so the main point is despite copyright protections, most companies do not publish their code publicly at all. For code that has to be pushed to clients, most companies even take efforts to obfuscate it.

amoss · on July 12, 2023

> If you’re sceptical, go look at one of the forums where people are building derivatives of Stable Diffusion (possibly NSFW, I’m not providing any links).

Does anybody have a link to a relevant discussion here? I would like to read about the creative process that goes into defining model weights, and how it differs from the mechanical output of running the training algorithm.

zarzavat · on July 12, 2023

There’s no use living a lie. Training a model is not an act of creative expression and cannot give you authorship of the weights. Enforcing GPL if you don’t have IP no better than any other copyright troll…

vr46 · on July 12, 2023

Reading these discussions with interest as I am in the process of making and training my own personal model using my 31+ archive of photography, pictures taken and owned by be, which goes against this idea that models are not trained on data owned by the companies doing the training. While this is all for my own personal interest and use, how would the idea that the weights cannot be copyrighted affect my rights on the model if I were to release the whole thing for use?

londons_explore · on July 12, 2023

I'd be interested to know how your model performs if it is trained only on your own work.

Assuming you haven't taken a photo a second for your entire life, then I suspect you'll struggle to make something even close to what's available publically, due to lack of training data.

vr46 · on July 12, 2023

Obviously not, but the point isn't to make a public model, but to make something out of my own - my question was how does ownership work? At some point, somebody is going to be making something out of their own massive archive.

GaggiX · on July 12, 2023

How big is the archive? These models are typically trained on at least 100M images.

vr46 · on July 12, 2023

There's around 1M images in there, I was wondering whether the labelling and object detection would be more important than the quantity.

GaggiX · on July 12, 2023

1M images is probably enough to do something, this user for example trained a diffusion model from scratch using 1.5M images and a 3090: https://medium.com/@enryu9000/anifusion-diffusion-models-for..., the quality of course is not excellent but it's something. I suggest to train a 4x64x64 diffusion model using the new SD XL VAE (it's a really good f8 VAE so it can encode, for example, images from 3x512x512 to 4x64x64), if the images have captions then I suggest using a CLIP text encoder as it was already trained on image text pairs, it would probably be much easier to use by a diffusion model trained on only 1M images instead of other text encoders like T5 that have better text understanding but they have never seen an image.

lucubratory · on July 12, 2023

Quantity is massively more important, to the point that you get much better results by using a larger dataset with machine generated labelling (or object detection if that's what you're building) than using a smaller one with even expert labelling.

That said, if you've got a million photos you could probably do some pretty interesting things with very large scale fine tuning, or if you know many other people who have similar stockpiles of photos you may be able to get an entry-level dataset together if you all pool it.

hedgehog · on July 11, 2023

I understand that LLMs to date have mostly been trained on a wide variety of copyright-encumbered data but in other domains (computer vision for example) the tradeoffs are different and in practice many models are trained on private / unencumbered data. If those weights are not protected by copyright then my concern is it will be hard to sufficiently protect them via license agreement and it will become yet another factor favoring the SaaSification of everything in tech.

sillysaurusx · on July 11, 2023

This is true, and it's why I hesitated to file legal action. My goal was to benefit hackers. If the outcome causes problems for people who are just trying to share their work, I'd be upset.

Ultimately what convinced me to proceed is that there are immense forces pressuring ML models to become SaaS companies. It's very difficult to offer an ML model for extended periods without being a company. E.g. https://6b.eleuther.ai/ is down. Eleuther failing illustrates just how hard it is –– we were all working as hard as we could to design something that would last a long time, and a long time turned out to be two short years. Contrast that with other kinds of hacking (e.g. webdev, gamedev, hardware...) where the end result lasts basically forever.

So if ML models aren't copyrightable, I think it'll hurt companies a lot more than individuals. In fact the goal is the other way around: to protect individuals. All I did was publish Facebook's own GPL download script to github, and it got DMCA'd. If we don't push back on that kind of behavior now, companies will get used to the idea that they control "their" model –– even when their model is anything but theirs.

tensor · on July 12, 2023

If an individual trains a model on their own data to embody their own skills and behaviour, so that they can then sell/rent that model out to work on their behalf, well in that scenario not being able to treat the weights as intellectual property (copyright or otherwise controllable by a license), would be a huge violation and detrimental to that individual.

I think it would be a shame to try to build legislation around the notion of the sass melting pot application of machine learning and in the process destroy all sorts of other use cases.

js8 · on July 12, 2023

> If an individual trains a model on their own data to embody their own skills and behaviour, so that they can then sell/rent that model out to work on their behalf

No, because we already do not treat all work as copyrightable. A plumber doesn't get copyright on his piping job. It has to be original enough. So while your own skill might be original enough to warrant copyright, distilling it into a model might not.

tensor · on July 12, 2023

An artists work is copyrightable, a writers work is copyrightable, an a personal model could reproduce those and also produce new works in the same style. Also, data can be intellectual property without being copyrightable.

js8 · on July 13, 2023

> an a personal model could reproduce those and also produce new works in the same style

Yes. So it's like creating a machine that can create art.

Perhaps it shouldn't be copyrightable, but patentable. I think I would be OK with ML models (weights) being patentable rather than copyrightable.

hedgehog · on July 12, 2023

I think the DMCA being a massive overreach is a separate issue from whether weights should be eligible for copyright. This is a complicated legal area and I'm very much not a lawyer so let me just stick to some examples that guide my thinking:

- Grammarly. Clear value prop, if weights can't be adequately protected then that's a significant headwind against doing processing on the client.

- Adobe Firefly. Could run locally, they understand the technical challenges well, same headwind.

- GitHub Copilot. Same.

Copyright protection is probably not a single deciding issue in their product strategy but all of those are use cases that would for most users be better run locally as hardware can support that and are not going to because it's too much of a risk. Better to limit distribution and protect as trade secret.

The most powerful force for openness I see has nothing to do with copyright eligibility and everything to do with companies wanting to showcase their research arms to build brand and support recruiting. That leads me to believe it's probably better for models to be eligible for copyright and considered derivatives of all of the constituent training data. In some ways the better parallel is sampling in the music industry. It'll be interesting to see how this plays out.

idiotsecant · on July 11, 2023

Is it useful to protect weights with copyright? What if I download your weights and retrain them for 5 seconds, changing each weight .0000001%? How much change is a new product? What if I change a single weight?

hedgehog · on July 11, 2023

Like the parallel scenarios of taking a book and changing a few words, slapping a new logo on someone else's app, or stylizing a photo with a filter, those are questions that will be answered in court if people can't come to an agreement on their own.

archivist0 · on July 11, 2023

> One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. https://the-eye.eu/

I'm afraid to say... the-eye no longer hosts the pile as of today due to legal threats above the likes of DMCA.

Though I believe it's still available via its original torrent and on at.

> https://academictorrents.com/details/0d366035664fdf51cfbe9f7...

sfriedr · on July 12, 2023

Of this is true, it would be something close of an insane situation: One of the largest datasets, that the largest companies are using to train their models (probably; many of the best LLMs have technical reports that raise more questions rather than answer them) being forced to live an obscure existance on torrents.

From a scientific point of view this is very problematic because few safeguards exist that guarantee that the dataset is not tampered with (as is the case if you'd upload it to Zenodo, which providea some guarantee of immutability).

How about trying to upload the Pile to Zenodo? Only half-joking :D

koheripbal · on July 12, 2023

I'm more interested in The Pile V2 which seems to have gone underground...

sfriedr · on July 11, 2023

Could you share more about copyright? For example, aren't you worried that now, with all kinds of lawsuits happening [1] and copyright issues that were found in existing datasets [2], that you might get threatening letters from a lawyer some day?

I'm the author of [3] where we introduced one of the first natural-language datasets that test graduate mathematics for LLMs, but some of the prompts we took from a copyrighted book and therefore thought about excluding them. Having them in the public dataset would be really nice though, hence I'm keen about your experience.

I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

[1] https://www.theguardian.com/books/2023/jul/05/authors-file-a... [2] https://arxiv.org/abs/2105.05241 [3] https://arxiv.org/abs/2301.13867

sillysaurusx · on July 11, 2023

I think a lot of hackers shy away from doing impactful work because of fear. Sometimes those fears are justified, but it's remarkable how often things that seem like a big deal turn out not to matter. My advice for ambitious devs would be to do what seems interesting, and don't worry too much about threatening letters. Usually the worst thing that happens is that you agree to stop doing whatever generated the threat.

Personally, I'm not worried. It would be a damn shame if academics come under fire merely for trying to operate on the cutting edge of science. None of us were trying to make money; we just wanted to make something interesting.

> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?

Thanks! I think we might be putting up a website for it soon, if only to explain ourselves. In the meantime – I hate this phrase, since I don't want followers – the only way to keep informed is to follow my Twitter, and perhaps keep an eye on my HN comments.

You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.

jacquesm · on July 11, 2023

What exactly is it that you claim copyright over? Are you sure that you have standing to bring that suit?

sillysaurusx · on July 11, 2023

It’s the other way around — Meta DMCA’d llama-dl, my github repo, claiming they control copyright of llama. Our assertion is that ML weights are uncopyrightable, much like a phone book - training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time.

I can send you the draft we’ve prepared if you’re interested — drop me an email. But I’ll probably set up a site for this, if only to clear up our motives and expected outcomes.

jacquesm · on July 12, 2023

Ah, I mis-interpreted 'I’ll be participating in a legal action against Meta' as you guys bringing a counter suit of some sort. Thanks for clearing that up.

Copyright lawsuits are usually a case of who has the biggest stamina and hence who has the biggest wallet. Your funding will be a very important part of the outcome, regardless of the legal merits of your defense. You may want to get out of any kind of control of GH because they are strongly connected to OpenAI through Microsoft and hence has a stake in getting rid of any reasonably competent open source LLM.

Make sure you know what you are in for, lawsuits with large counterparties are a rodeo and even if you win they can make your life miserable with endless appeals. You will have to be prepared to spend years on this. Much good luck and if you set up a site do post the link.

nickpsecurity · on July 18, 2023

"training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time."

Couldn't you use a similar type of argument to say that different implementations of the same software API are basically the same software even though the instructions are completely different each time? And, since they do the same thing (aka compatible), that software implementations can't be subject to copyright for that reason? I don't think that holds up, esp in a pro-copyright country.

Here's one I came up with in case your lawyers can use it. My original goal was a license for proprietary content to be used in LLM's where the creators were worried about verbatim extraction or whether their content was sufficiently mixed in with other data. It was about motivating them to let us train on such data. I'll start with those terms:

"1. Percentage of total data. The copywritten work must not be larger than N% of total, training data put into a model. If it's tiny enough, one might be able to argue it only adds so much weigh to the outputs. What if it's the only data of its type, though?

2. Merged with similar data. The copywritten work must be one of multiple examples of the same types of data. For instance, there might be many examples given to the model about what files are, how to generate them, doing it in Python, and specific examples in Python. When it generates Python code, any or all of this might have contributed to it.

3. Ratio of data, set size to number of parameters. The content owners might want the training data to exceed the number of parameters by a multiplier N. For instance, at least 10GB or 100GB going into a 1G model. The multiplier is 10.

4. Diverse data. The content owner might want a wide range of data on many topics to go into the model. They might even specify certain data sets, a minimum number of topics, or even a number of word vectors per word used (their keywords). Once again, the odds the model is just repeating one piece of data goes down as the number of data and similar words in the model goes up."

So, basically you'd be trying to set a standard where anything the model creators legally have access to that they can put into their LLM's. Are the LLM's then carrying their I.P. or something novel? If novel, we're safe from lawsuits. If LLM's and outputs are not copyrightable, we'd be double safe in that situation.So, maybe use criteria like the above to decide what's novel where anything within certain numbers or combinations would be novel automatically by law or court precedent. What do you think?

Der_Einzige · on July 11, 2023

Getting sued is straight up a good thing for most peoples careers in tech. Haven't you watched silicon valley?

jacquesm · on July 11, 2023

> It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.

In my opinion the most ethical outcome would be that they are on the hook for the cumulative cost of the copyright they violated. That way authors would come out ahead instead of having their rights trashed 'because it's too late anyway'.

nl · on July 12, 2023

Learning from something has never been copyright violation before, even when a computer was learning (eg, building a search index from copyrighted data is fair use; cite: Google cases).

cornel_io · on July 11, 2023

Whether or not training on publicly available data counts as a copyright violation is still completely up in the air legally, and clearly a lot of lawyers at all of the top tech companies think they're going to end up in the clear under fair use.

At some point this stuff will have to get tested by making its way up the appeals stack in the US, and IMO there is only a minuscule chance that will result in Google, MS, and Meta getting slapped with anything more than a token fine (my bet is it won't even be that), let alone paying every person who ever wrote anything that was used in these datasets for copyright violations, which would basically be everyone.

jacquesm · on July 11, 2023

There are more courts than just the US ones.

lucubratory · on July 12, 2023

Yes, there are other courts than the US ones, and generally the law there is significantly more favourable to TDM with regards to copyright, with the exception of the PRC.

Examples:

Japan: Article 30-4 of the Japanese Copyright Act. No special action on the part of companies is necessary for compliance. All models are legal so long as their output is legal.

The UK: s.29A of the Copyright, Designs and Patents Act 1988 (CDPA). Models must be trained by non-profit research institutes, and can then be used by anyone (including for profit entities); similar to the Stable Diffusion model.

The EU: Articles 3 & 4 of the Directive on Copyright in the Digital Single Market (CDSM). There are no restrictions on non-profit TDM, same as the UK. For-profit TDM is exempted from copyright so long as the data harvesting process respects an "opt-out" process, where specific contractual forms/disclosures of opting out of inclusion in the training data are respected.

Singapore: Articles 243 & 244 of the Copyright Act. No special action on the part of companies is necessary for compliance. All models are legal so long as their output is legal.

rpdillon · on July 11, 2023

> on the hook for the cumulative cost of the copyright they violated.

I think there's a strong argument for a Fair Use defense, given the size of the models versus the size of the training sets, as well as the gulf in intended use: an AI model doesn't compete with e.g. a book. Obviously we'll have to see if play out in court to find out.

ben_w · on July 11, 2023

Current AI models don't compete with a book, from what I've seen; I wouldn't want to bet how long it takes before they can compete with not just one but all books.

nborwankar · on July 12, 2023

AI models compete with movie script writers. I believe the current writers strike in Hollywood includes ChatGPT4 related issues amongst many others.

lukemerrick · on July 12, 2023

Related to the idea of "no one trains on data they own, they shouldn't own the resulting model": since big public datasets like The Pile have CC-SA items in them, is anyone considering bringing the argument that model weights are derivative work that must be "shared alike"?

koheripbal · on July 12, 2023

By that token, my brain is a derivative work of all the copyrighted works I've consumed

koheripbal · on July 12, 2023

What ever happened with The Pile V2? I spent a couple hours searching for it, but the Eye is impossible to navigate and people on the discord generally invite noobs like myself.

moffkalast · on July 12, 2023

> They DMCA’ed one of my repos distributing LLaMA

Boy they'll be mad once they learn about Huggingface distributing thousands of LLama fine tunes with full weights.

redox99 · on July 12, 2023

Weights being copyrightable is already a questionable thing. Derivatives (like finetunes) is even more questionable.

Roark66 · on July 12, 2023

Great stuff, I skimmed the article searching for some table showing a breakdown of content by language, but I haven't found one.

I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly "original" data.

The result? You can use chatgpt to write you an email of any kind in English and you can copy/paste/send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.

koheripbal · on July 12, 2023

While having multiple languages makes a model more versatile and appeal to a wider audience, it actually significantly increases the memory required to run the model and thus limits other aspects of the model.

Optimally, a Polish audience should try to create a Polish trained model.

As it stands now, most advanced models, like gpt are multilingual, but are noticeably less capable in non-English languages.

spi · on July 12, 2023

Having every model re-trained in each language is a certain path towards having any non-English (or at most a couple of other languages from countries with big pockets, like Chinese) language model be always massively behind - the resources required to train a model are huge, you can't expect e.g. the Polish community (plus anyone else) to replicate every good English model that comes out. GPT4 is less capable in Polish than in English, but probably much more than any Polish-specific model ever trained - and I suspect the gap is bigger than that with the best non-GPT4 English model.

Furthermore, I think you are exaggerating the memory issue of multilingual models significantly. Especially for languages using the same (Latin) script, the additional characters to care about are very few. Also a significant part of the vocabulary and language fall into a few buckets, so training a joint model makes all the sense in the world - much like an Italian native speaker could likely study a scientific text in Spanish and understand its content, even without speaking the language.

The memory impact comes mostly from having bigger embedding layers that have to account for vocabulary in many languages (the most problematic case being Chinese and Japanese, with their huge set of tokens). But even there, the largest vocabularies in use are maybe of size 100k (vs. about 30k for English-only), with a hidden dimension of 4k that makes for a total of 400M parameters. It's a lot, but a drop in the ocean of 100B+ parameters (or 1T+ for GPT4) we're seeing today.

P.S. Answering to GP, I think the Pile is English only, though - or at least, models on HuggingFace trained on the Pile, like the various Pythia models, are tagged as English only.

Roark66 · on July 13, 2023

The biggest difficulty people have IMO when trying to train language models in non-English languages is that there is not enough text written in these other languages to select a big good quality dataset.

Also, there are lots of (poorly) machine translated websites in Polish... So any dataset that contains web crawl will have precisely what I'd prefer not to have.

Ideally I'd see either the national gov, or the EU to invest money into creation of more high quality datasets in all EU languages.

So when I point out the failures of for example chatgpt in my language I do so while being amazed it can generate and understand Polish at all.

Also regarding multilingual model size being larger. I never heard this before, but it seems logical. I have heard models gain extra performance on English tasks when they are trained on other languages too so there is a benefit to adding multilingual datasets.

cschmidt · on July 12, 2023

The Pile and Red Pajama are primarily English language datasets. If you want something multilingual, I'd suggest having a look at the Bloom dataset https://arxiv.org/abs/2210.14712

dang · on July 11, 2023

The Pile: An 800GB Dataset of Diverse Text for Language Modeling - https://news.ycombinator.com/item?id=25607809 - Jan 2021 (60 comments)

cschmidt · on July 11, 2023

If you’re looking at The Pile, you also might consider the Red Pajama dataset. A new cleaned version was released recently https://www.cerebras.net/blog/slimpajama-a-627b-token-cleane...

CamperBob2 · on July 12, 2023

Is there a straightforward way to download that dataset, the way there was for the original RedPajama data? SlimPajama appears to have been released as 60,000 small files, which is ridiculous.

Der_Einzige · on July 11, 2023

I came so close to getting my dataset DebateSum (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) into the pile, but they decided at the last minute not to add it: https://github.com/EleutherAI/the-pile/issues/56

I'm still a tiny bit salty about that, but the pile is a wonderful dataset regardless.

orange_fritter · on July 11, 2023

That dataset looks cool. Good work either way, I'm sure it'll go somewhere

Der_Einzige · on July 11, 2023

Stay tuned! I've got a paper I'm writing about a new followup which is a 40x improvement in size (basically every open source debate card... Ever) and a 40x improvement in metadata and duplication detection. The work is all done since late april and I've just been lazy/writer-blocked (ironic in a world of high end LLMs) and haven't gotten the paper finished.

Kinda of sad to have missed NeurIPS dataset track deadline and ACL, but I know that anything close to this in scope is a slam-dunk accept at the argument mining workshop

robmsmt · on July 12, 2023

Would love to see an early version of it!

charlysl · on July 11, 2023

OP here. I learned about this while reading Stanford's LLM course's "Data" lecture [1]. Very interesting how it assesses the datasets used for GPT 2 and 3, etc, and how The Pile addresses their issues. A very interesting course!

[1] https://stanford-cs324.github.io/winter2022/lectures/data/

pjot · on July 11, 2023

The Pile was also referenced in a post today of some guys tweets about “leaked” gpt4 details

https://news.ycombinator.com/item?id=36675934

robertheadley · on July 11, 2023

As long as LLMs and generative AI uses copywritten works for training, then they are going to be the enemy of creative people.

koheripbal · on July 12, 2023

This is like saying that my brain violates copyright when I write sci-fi because one time, years ago, I watched Star Wars.

mattkevan · on July 12, 2023

Creative people will be using LLMs and other models as new and exciting creative tools.

Their real enemies will be the people who make money off the creative people’s work, e.g. the entire history of recorded music or the current writers strike.

splatzone · on July 12, 2023

Unless the financial benefit could be shared with the original authors somehow, with some kind of royalties system?

fsckboy · on July 12, 2023

I love how "creatives" enjoy the freedom of the free internet but never try to shame their peers as to whether they use GPL or MIT license for their art.

robertheadley · on July 22, 2023

I think the more matter of fact the influence, the more the original artist deserves compensation. See Waits v. Frito-Lay, Inc.

I do not not want something like this to happen to generative AI and make things more difficult for the technology to progress and flourish.

ryoshiro · on July 12, 2023

Side Topic: In the leaked OpenAI GPT-training details, there are speculations that OpenAI trained on Libgen dataset. Is there a link to the dataset of Libgen, if so how big is it?