AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
Read original articleRecent research published in Nature highlights a phenomenon termed "model collapse" in artificial intelligence (AI) models, particularly large language models (LLMs) and generative models like variational autoencoders (VAEs) and Gaussian mixture models (GMMs). The study investigates the implications of training these models on data generated by their predecessors, revealing that indiscriminate use of such recursively generated data leads to irreversible defects. As models are trained on content produced by earlier versions, they begin to lose information about the true underlying data distribution, particularly the tails of this distribution, resulting in a degenerative process where the models converge to a state that poorly represents the original data.
The authors identify three primary sources of error contributing to model collapse: statistical approximation error, functional expressivity error, and functional approximation error. These errors compound over generations, causing models to misinterpret reality and ultimately converge to a distribution with significantly reduced variance. The research emphasizes the importance of maintaining access to genuine human-generated data, as the increasing prevalence of LLM-generated content on the internet will pollute future training datasets. The findings underscore the necessity of addressing model collapse to preserve the benefits of training on large-scale data, as the value of authentic human interactions becomes increasingly critical in the evolving landscape of AI.
Related
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
AI Scaling Myths
The article challenges myths about scaling AI models, emphasizing limitations in data availability and cost. It discusses shifts towards smaller, efficient models and warns against overestimating scaling's role in advancing AGI.
Google Researchers Publish Paper About How AI Is Ruining the Internet
Google researchers warn about generative AI's negative impact on the internet, creating fake content blurring authenticity. Misuse includes manipulating human likeness, falsifying evidence, and influencing public opinion for profit. AI integration raises concerns.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.
The key word there is "indiscriminate". All of the big AI labs have been training on synthetic data for at least a year at this point, but they're doing so deliberately.
I don't think the "model collapse" problem is particularly important these days. The people training models seem to have that well under control.
More broadly, this is a reflection of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The issue is that any model's purpose is to capture novel, useful data about real human behavior. Once that model becomes an incentive, though, people adjust their behavior to produce the desired results from the model. Authentic behavior disappears, which means there's no useful information content for the model to capture, and future generations of the model instead just reproduce behaviors of the previous generation they were trained on, including quirks. Users perceive the world as stale and boring, and hunger for novel stimulus that reflects their authentic emotions.
You could look at this as a full-employment theorem for entrepreneurs and artists.
https://openai.com/index/prover-verifier-games-improve-legib...
https://towardsdatascience.com/addressing-concerns-of-model-...
In this case of course there are multiple LLMs that are creating text which finds its way to the web, but to the extent that the output of the different LLMs have commonalities, this still seems problematic.
And afaik, there are no metrics or algorithms that reliably distinguish between human-generated and LLM-generated text, at least not for the current generations of LLMs.
What am I missing?
I copied that into a Gist to make it easier to browse here: https://gist.github.com/simonw/b3ab1588a681dda821da9fb57290d...
Publishing in nature in ML can actually be a red flag, because they're really not well equipped to evaluate a lot of claims.
The latest llama model got a lot of its data using labels from llama2, and every frontier lab is talking about self training as the future.
1. this is nothing that should surprise anyone who has an intuition on control theory and the evolution of unconstrained markov chains
2. there appear to be relatively easy mitigations https://news.ycombinator.com/item?id=41061085 (made a separate post because it might be of independent interest to discuss)
3. you still won't get beyond the imititation game boundary without exploration & feedback, i.e. the recursive improvement doomers are, as of now, still wrong
"Given that training a single moderately large model produces twice the American lifetime’s worth of CO2 (ref. 15), we opted to not run such an experiment and instead focus on a more realistic setting for a proof of concept."
However, could it be that texts generated by AI models posses some kind of statistical property which causes training to collapse? Then, would it allow us to use it to detect AI texts?
I can learn from Pythagorus' work, extend it, combine it, apply it, and produce works that are more valuable than the original. Perhaps that gets recognized as important, and others then take that, learn, and repeat the process adding their own experience, increasing the general intelligence.
Prior generations learned this by copying VHS tapes over and over and making photocopies of photocopies. You can see it today by opening and saving a JPG over and over again.
The problem of 'model collapse': how a lack of human data limits AI progress - https://news.ycombinator.com/item?id=41058867 - July 2024 (6 comments)
Alpha zero used a similar approach where it trained against itself and that only made it better. I don't think collapse is real.
Slop in --> yikes
I wrote this over a year ago about this. Don't build a city on rock and roll. Don't build a business on a fractal.
The "tragedy of the commons" is another one of those parts of standard economic theory that never actually played out in reality - we've got examples from all over the world of communities implementing practices and often entire belief systems that led them to be responsible stewards of shared resources without requiring unilateral ownership of that resource and singular acquisition of the benefits of that stewardship, and yet first on the lips of every modern capitalist when describing why they're at a disadvantage if they're not the ones polluting the water supply is the tragedy of the commons.
Anyone willing to weigh in with a theoretical intuition? The one in the paper is just a little inaccessible to me right now.
Related
Large Language Models are not a search engine
Large Language Models (LLMs) from Google and Meta generate algorithmic content, causing nonsensical "hallucinations." Companies struggle to manage errors post-generation due to factors like training data and temperature settings. LLMs aim to improve user interactions but raise skepticism about delivering factual information.
AI Scaling Myths
The article challenges myths about scaling AI models, emphasizing limitations in data availability and cost. It discusses shifts towards smaller, efficient models and warns against overestimating scaling's role in advancing AGI.
Google Researchers Publish Paper About How AI Is Ruining the Internet
Google researchers warn about generative AI's negative impact on the internet, creating fake content blurring authenticity. Misuse includes manipulating human likeness, falsifying evidence, and influencing public opinion for profit. AI integration raises concerns.
Overcoming the Limits of Large Language Models
Large language models (LLMs) like chatbots face challenges such as hallucinations, lack of confidence estimates, and citations. MIT researchers suggest strategies like curated training data and diverse worldviews to enhance LLM performance.