'Model collapse'? An expert explains the rumours about an impending AI doom
Model collapse in AI refers to reduced effectiveness from reliance on AI-generated data. Concerns include diminished quality and diversity of outputs, prompting calls for better regulation and competition in the sector.
Read original articleModel collapse is a term gaining traction in discussions about artificial intelligence (AI), referring to a potential scenario where AI systems become less effective due to an over-reliance on AI-generated data for training. As generative AI systems proliferate, they increasingly depend on high-quality human data to learn and improve. However, the rise of AI-generated content raises concerns that future models may be trained predominantly on this synthetic data, leading to a decline in quality and diversity of AI outputs. This phenomenon, likened to digital inbreeding, could result in AI systems that are less helpful, honest, and diverse. While tech companies are attempting to filter out AI-generated content, the challenge is significant, as distinguishing between human and AI content becomes increasingly difficult. Although some experts warn of an impending catastrophe, the likelihood of complete model collapse may be overstated, as human and AI data are expected to coexist. The future may see a variety of generative AI platforms rather than a single dominant model, which could mitigate risks. Nonetheless, the proliferation of AI content poses risks to the digital landscape, including reduced human interaction and cultural homogenization. To address these challenges, there is a call for better regulation, competition in the AI sector, and research into the social implications of AI.
- Model collapse refers to AI systems becoming less effective due to reliance on AI-generated data.
- High-quality human data is essential for training effective AI models.
- The challenge of filtering AI-generated content is increasing as it becomes harder to distinguish from human content.
- The risk of model collapse may be overstated, with a future of diverse AI platforms likely.
- Proliferation of AI content threatens human interaction and cultural diversity online.
Related
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
The problem of 'model collapse': how a lack of human data limits AI progress
Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.
Google Researchers Publish Paper About How AI Is Ruining the Internet
Google researchers warn that generative AI contributes to the spread of fake content, complicating the distinction between truth and deception, and potentially undermining public understanding and accountability in digital information.
AI trained on AI garbage spits out AI garbage
Research from the University of Oxford reveals that AI models risk degradation due to "model collapse," where reliance on AI-generated content leads to incoherent outputs and declining performance.
There's Just One Problem: AI Isn't Intelligent
AI mimics human intelligence without true understanding, posing systemic risks and undermining critical thinking. Economic benefits may lead to job quality reduction and increased inequality, failing to address global challenges.
I don't get why this is a bad news. It meant that most likely these questions would have been duplicates and/or very easy, as ChatGPT could fix them.
My issue with SO is on the opposite side of the spectrum. By the time I am ready to post a question after having done my homeworks, I know it is either an unfixable bug or out of reach for most if not all users, and the probability of getting an answer is very low.
Most of the time I then answer my own questions a few months later.
Steel manufactured before 1944[0] can be incredibly valuable[1] because it hasn't been tainted by the nuclear tests/bombs fallout, and maybe fairly soon any archive internet material written pre-2010 will be considered equally valuable.
[0] https://en.wikipedia.org/wiki/Low-background_steel
[1] https://interestingengineering.com/science/what-is-pre-war-s...
Since, I assume, humans don't need this much training (?) this area would seem to be ripe to explore — can you achieve similar training with a fraction of the data needed for GPT-3.
The answers soon converge to some boring, bland repetition that isn't even logical.
No intelligence here, which means that code emitting LLMs are just stealing human IP that they happened to have read.
This is why the scientific method was developed.
If you want an AI that truly learns then it has to get up of the sofa, and test the ideas in the real world, rather than learning to parrot hearsay.
In the oral phase of human communication "AI models" were residing within human brains: the origin of any stream of messages was literally in front of you, in flesh and blood.
In the print phase we have the first major decoupling. Whether in the form of a cuneiform tablet or a modern paperback, the provenance was no longer assured. We had to rely on "controlled" seals, trust the publishers and their distribution chains etc. Ultimately this worked because the relative difficulty of producing printed artifacts helped developed a legal/political apparatus to control the spread of the "fake" stuff.
Enter the digital communications era and we have the second major decoupling. The amount (and increasingly the apparent veracity) of generating human oriented messaging is no longer a limiting factor. This has no precedent. You can create now plausibly create a fake Wikipedia [1] by just running a model. The signal-to-noise of digitally exchanged messages can experience catastrophic collapse.
> A flood of synthetic content might not pose an existential threat to the progress of AI development, but it does threaten the digital public good of the (human) internet.
Indeed, this is the real risk. I don't care about idiot "AI" feeding on itself but I do care about destroying any basis of sane digital communication between humans.
Will we develop the legal/political apparatus to control the flood of algorithmic junk? Will it be remotely democratic? The best precedent of digital automation destroying communication channels and our apparent inability and/or unwillingness to do something about it is email spam [2]. Decades after it first appeared spam is still degrading our infosphere. The writing is on the wall.
[1] https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wi...
I've posted about it here: https://news.ycombinator.com/item?id=41125309
I did note how the author said that training models on other models doesn't have the ethical implications of training on stolen human data - except it does, because where did the first model get its training set from? This is why we make it illegal not just to steal but also to handle stolen goods.
I mean apart from wow ! 98% of the internet is shit (that’s higher than even I assumed) - how did they differentiate between good and shit? PageRank? Length? I can tell the difference between good writing and bad writing (and porn and erotica) but I have to read it using my brain based LLM - how did they do it ?
This leads to the whole open source, show us your training data thing
Related
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
The problem of 'model collapse': how a lack of human data limits AI progress
Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.
Google Researchers Publish Paper About How AI Is Ruining the Internet
Google researchers warn that generative AI contributes to the spread of fake content, complicating the distinction between truth and deception, and potentially undermining public understanding and accountability in digital information.
AI trained on AI garbage spits out AI garbage
Research from the University of Oxford reveals that AI models risk degradation due to "model collapse," where reliance on AI-generated content leads to incoherent outputs and declining performance.
There's Just One Problem: AI Isn't Intelligent
AI mimics human intelligence without true understanding, posing systemic risks and undermining critical thinking. Economic benefits may lead to job quality reduction and increased inequality, failing to address global challenges.