July 24th, 2024

AI models collapse when trained on recursively generated data

Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.

Read original article

AI models collapse when trained on recursively generated data

Recent research published in Nature highlights a phenomenon termed "model collapse" in artificial intelligence (AI) models, particularly large language models (LLMs) and generative models like variational autoencoders (VAEs) and Gaussian mixture models (GMMs). The study investigates the implications of training these models on data generated by their predecessors, revealing that indiscriminate use of such recursively generated data leads to irreversible defects. As models are trained on content produced by earlier versions, they begin to lose information about the true underlying data distribution, particularly the tails of this distribution, resulting in a degenerative process where the models converge to a state that poorly represents the original data.

The authors identify three primary sources of error contributing to model collapse: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing models to misinterpret reality and ultimately converge to a distribution with significantly reduced variance. The research emphasizes the importance of maintaining access to genuine human-generated data, as the increasing prevalence of LLM-generated content on the internet will pollute future training datasets. The findings underscore the necessity of addressing model collapse to preserve the benefits of training on large-scale data, as the value of authentic human interactions becomes increasingly critical in the evolving landscape of AI.

Large Language Models are not a search engine

Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.

AI Scaling Myths

The article challenges myths about scaling AI models, emphasizing limitations in data availability and cost. It discusses shifts towards smaller, efficient models and warns against overestimating scaling's role in advancing AGI.

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google researchers warn about generative AI's negative impact on the internet, creating fake content blurring authenticity. Misuse includes manipulating human likeness, falsifying evidence, and influencing public opinion for profit. AI integration raises concerns.

Overcoming the Limits of Large Language Models

Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.

36 comments

By @simonw - 9 months

> We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models

The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.

I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.

By @nostrademons - 9 months

This has happened with much simpler models than LLMs, eg. Google Suggest became noticeably worse when everybody started using Google Suggest to input their queries, because it was trained on real query logs and those query logs started to simply reproduce the output of the Suggest model. SEO and Webspam have similar problems within Google Search.

More broadly, this is a reflection of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The issue is that any model's purpose is to capture novel, useful data about real human behavior. Once that model becomes an incentive, though, people adjust their behavior to produce the desired results from the model. Authentic behavior disappears, which means there's no useful information content for the model to capture, and future generations of the model instead just reproduce behaviors of the previous generation they were trained on, including quirks. Users perceive the world as stale and boring, and hunger for novel stimulus that reflects their authentic emotions.

You could look at this as a full-employment theorem for entrepreneurs and artists.

By @Kuinox - 9 months

Meanwhile OpenAI, Anthropics, trains on AI generated data to improve their models, and it works.

https://openai.com/index/prover-verifier-games-improve-legib...

https://www.anthropic.com/research/claude-character

By @TJM2084 - 9 months

Here's Gretel's response to the Nature paper on model collapse. Covers the methodology, and how it's flawed, in detail and highlights a lot of other great synthetic data research.

https://towardsdatascience.com/addressing-concerns-of-model-...

By @mcswell - 9 months

I must be missing something. Training on the output of your system as if it were validated input seems like an obvious no-no. I'm not talking about using synthetic data (however that might be created in this situation), but rather using anything and everything found on the web as if it were "real", i.e. as if it were human-generated texts rather than the output of the LLM.

In this case of course there are multiple LLMs that are creating text which finds its way to the web, but to the extent that the output of the different LLMs have commonalities, this still seems problematic.

And afaik, there are no metrics or algorithms that reliably distinguish between human-generated and LLM-generated text, at least not for the current generations of LLMs.

What am I missing?

By @simonw - 9 months

The source code that accompanies the paper is available in a zip file here: https://zenodo.org/records/10866595

I copied that into a Gist to make it easier to browse here: https://gist.github.com/simonw/b3ab1588a681dda821da9fb57290d...

By @hiddencost - 9 months

A lot of these papers are wrong. They do something wrong in their setup and then claim their conclusion shows show general truth.

Publishing in nature in ML can actually be a red flag, because they're really not well equipped to evaluate a lot of claims.

The latest llama model got a lot of its data using labels from llama2, and every frontier lab is talking about self training as the future.

By @vzaliva - 9 months

I call this "LLM inbreeding." It's a vicious loop where new models are trained on AI-generated content, resulting in the quality degenerating with each generation.

By @bjourne - 9 months

The article contains no proof of theorem 3.1 and finding counterexamples seems trivial. Adult male weight can be modeled by N(85, 20). You can recursively "train" the model on data it generates without having it collapse. It will stay stationary as long as the samples are large enough.

By @igorkraw - 9 months

It should be noted that

1. this is nothing that should surprise anyone who has an intuition on control theory and the evolution of unconstrained markov chains

2. there appear to be relatively easy mitigations https://news.ycombinator.com/item?id=41061085 (made a separate post because it might be of independent interest to discuss)

3. you still won't get beyond the imititation game boundary without exploration & feedback, i.e. the recursive improvement doomers are, as of now, still wrong

By @mcguire - 9 months

Nature published a computer science paper???!

"Given that training a single moderately large model produces twice the American lifetime’s worth of CO2 (ref. 15), we opted to not run such an experiment and instead focus on a more realistic setting for a proof of concept."

By @megaman821 - 9 months

There are other ways AI can help train other AI that aren't generating data. AI could remove low quality data from a training set. It could assist humans in structuring video, 3D and physics simulation datasets for the best learning results.

By @throwthrowuknow - 9 months

So they fine tuned an existing model using its own completions to produce the training set for the next run which uses the fine tuned model as the base. They mention catastrophic forgetting so they are aware of it. I suppose they wanted to get results as quickly as possible but this isn’t an accurate model of reality (pun not intended). They’ve only succeeded in demonstrating something that is well known. If they had made the effort to simulate mitigation of bad data and a growing corpus that included proportionally more synthetic data over time it would have been interesting.

By @nfca - 9 months

I thought this was fairly obvious. Imperfections would only compound over time. Does anyone remember recursively inter-translating between two languages?

By @Strumpfli - 9 months

If I'm correct, we generally percieve AI generated data to be indistinguishable from a human sourced data and we don't have a tool to reliably assess whether a text is AI generated.

However, could it be that texts generated by AI models posses some kind of statistical property which causes training to collapse? Then, would it allow us to use it to detect AI texts?

By @exabrial - 9 months

Maybe this is true test of intelligence instead of "emulating intelligence"?

I can learn from Pythagorus' work, extend it, combine it, apply it, and produce works that are more valuable than the original. Perhaps that gets recognized as important, and others then take that, learn, and repeat the process adding their own experience, increasing the general intelligence.

By @koliber - 9 months

Conceptually, a LLM is a lossy compression of all of the data it saw during training. If you feed it lossy data, at each iteration you will get poorer and poorer signal and more noise.

Prior generations learned this by copying VHS tapes over and over and making photocopies of photocopies. You can see it today by opening and saving a JPG over and over again.

By @dang - 9 months

Related ongoing thread:

The problem of 'model collapse': how a lack of human data limits AI progress - https://news.ycombinator.com/item?id=41058867 - July 2024 (6 comments)

By @tyingq - 9 months

Which is good background to this story about Reddit locking down robots.txt and trying to get money from the AI teams scraping their content.

https://news.ycombinator.com/item?id=41057033

By @jksmith - 9 months

Given a time snapshot and enough computing power, isn't recursion inevitable? It's like running out of known universe given time x. So then we're back creating data without a prior dataset, which is still a human domain.

By @anon291 - 9 months

Is this an artifact of floating point precision or a fundamental mathematical truth.

By @nurettin - 9 months

I don't see how this hurts training unless you hurl all hallucinations back at the model.

Alpha zero used a similar approach where it trained against itself and that only made it better. I don't think collapse is real.

By @zby - 9 months

If the model collapse means that the text produced by it is not statistically identical to the garbage that fills the Internet - then I guess a collapse is the goal.

By @m3kw9 - 9 months

Of course it will collapse if you don’t verify it, I remember OpenAI talking about its research into having a different model verify that data somehow

By @betenoire - 9 months

Seems analogous to the effect of echo chambers on humans

By @ziofill - 9 months

Very interesting. But wouldn't human preferences still find their way into the datasets of the future?

By @daft_pink - 9 months

It’s like the ai generated version of index funds.

By @swayvil - 9 months

There's a complexity missing there. It's like the effects of incest upon dna. Or an echo chamber upon conversation.

By @jlos - 9 months

As far as I understand Douglas Hofstadter's Godel, Escher, Bach - self-referential recursive structures (strange loops) are the foundation of consciousness (among other interesting things). I've been watching to see if LLM's becoming self-referential actually improves them as opposed to degrades them.

By @FredPret - 9 months

Data in --> slop out.

Slop in --> yikes

By @asadm - 9 months

"Breathing in your own exhaust can be fatal"

By @DaoVeles - 9 months

No say it ain't so /s

I wrote this over a year ago about this. Don't build a city on rock and roll. Don't build a business on a fractal.

https://theluddite.org/#!post/the-snake-eats-itself

By @roughly - 9 months

Back when I was getting my econ degree, we were taught about the Ultimatum game, which goes like this: You get two participants who don't know each other and will (ostensibly) never see each other again. You give one of them $100, and they make an offer of some portion of it to the other. If the other accepts, both parties keep their portion - so, if A offers B $20, and B accepts, A keeps $80 and B keeps $20, if B rejects, both parties get nothing. Standard economic theory suggests A can offer $1 and B will accept, because otherwise B gets nothing. Spoiler for those of you who haven't seen how standard economic theory plays out in real life, that's not how the game went - typically, offers below ~$30 or so got rejected, because B was a real feeling person who felt like they were getting screwed and opted to punish A for doing so. The exception to this - the people who would take the $1 offer - were people who had been taught economic theory. It turns out you _could_ screw them over and they'd pat themselves on the backs for being very wise.

The "tragedy of the commons" is another one of those parts of standard economic theory that never actually played out in reality - we've got examples from all over the world of communities implementing practices and often entire belief systems that led them to be responsible stewards of shared resources without requiring unilateral ownership of that resource and singular acquisition of the benefits of that stewardship, and yet first on the lips of every modern capitalist when describing why they're at a disadvantage if they're not the ones polluting the water supply is the tragedy of the commons.

By @padraicmahoney - 9 months

This seems extremely interesting, but I don't have the time right now to read this in depth (given I would also need to teach myself a bunch of technical concepts too).

Anyone willing to weigh in with a theoretical intuition? The one in the paper is just a little inaccessible to me right now.