August 20th, 2024

'Model collapse'? An expert explains the rumours about an impending AI doom

Model collapse in AI refers to reduced effectiveness from reliance on AI-generated data. Concerns include diminished quality and diversity of outputs, prompting calls for better regulation and competition in the sector.

Read original articleLink Icon
'Model collapse'? An expert explains the rumours about an impending AI doom

Model collapse is a term gaining traction in discussions about artificial intelligence (AI), referring to a potential scenario where AI systems become less effective due to an over-reliance on AI-generated data for training. As generative AI systems proliferate, they increasingly depend on high-quality human data to learn and improve. However, the rise of AI-generated content raises concerns that future models may be trained predominantly on this synthetic data, leading to a decline in quality and diversity of AI outputs. This phenomenon, likened to digital inbreeding, could result in AI systems that are less helpful, honest, and diverse. While tech companies are attempting to filter out AI-generated content, the challenge is significant, as distinguishing between human and AI content becomes increasingly difficult. Although some experts warn of an impending catastrophe, the likelihood of complete model collapse may be overstated, as human and AI data are expected to coexist. The future may see a variety of generative AI platforms rather than a single dominant model, which could mitigate risks. Nonetheless, the proliferation of AI content poses risks to the digital landscape, including reduced human interaction and cultural homogenization. To address these challenges, there is a call for better regulation, competition in the AI sector, and research into the social implications of AI.

- Model collapse refers to AI systems becoming less effective due to reliance on AI-generated data.

- High-quality human data is essential for training effective AI models.

- The challenge of filtering AI-generated content is increasing as it becomes harder to distinguish from human content.

- The risk of model collapse may be overstated, with a future of diverse AI platforms likely.

- Proliferation of AI content threatens human interaction and cultural diversity online.

Link Icon 12 comments
By @KolmogorovComp - about 2 months
> For instance, researchers found a 16% drop in activity on the coding website StackOverflow one year after the release of ChatGPT. This suggests AI assistance may already be reducing person-to-person interactions in some online communities.

I don't get why this is a bad news. It meant that most likely these questions would have been duplicates and/or very easy, as ChatGPT could fix them.

My issue with SO is on the opposite side of the spectrum. By the time I am ready to post a question after having done my homeworks, I know it is either an unfixable bug or out of reach for most if not all users, and the probability of getting an answer is very low.

Most of the time I then answer my own questions a few months later.

By @ColinWright - about 2 months
People flooding the 'net with LLM-generated crap are both eating their seed corn and poisoning the well.

Steel manufactured before 1944[0] can be incredibly valuable[1] because it hasn't been tainted by the nuclear tests/bombs fallout, and maybe fairly soon any archive internet material written pre-2010 will be considered equally valuable.

[0] https://en.wikipedia.org/wiki/Low-background_steel

[1] https://interestingengineering.com/science/what-is-pre-war-s...

By @JKCalhoun - about 2 months
> To train GPT-3, OpenAI needed over 650 billion English words of text – about 200x more than the entire English Wikipedia.

Since, I assume, humans don't need this much training (?) this area would seem to be ripe to explore — can you achieve similar training with a fraction of the data needed for GPT-3.

By @antklan - about 2 months
This can be replicated on a small scale by using two LLMs (which can be two instances of the same LLM). Start with a human prompt, then feed the answer of LLM-1 as the prompt to LLM-2, then feed that answer of LLM-2 to LLM-1 and so forth.

The answers soon converge to some boring, bland repetition that isn't even logical.

No intelligence here, which means that code emitting LLMs are just stealing human IP that they happened to have read.

By @DrScientist - about 2 months
In the human space there is also 'model collapse' as demonstrated by group-think, or the dark arts of marketing/PR to sway opinion in the large.

This is why the scientific method was developed.

If you want an AI that truly learns then it has to get up of the sofa, and test the ideas in the real world, rather than learning to parrot hearsay.

By @openrisk - about 2 months
We can speculate all day long how the current "AI" phenomenon will evolve but, alas, there isn't much solid on which to ground arguments.

In the oral phase of human communication "AI models" were residing within human brains: the origin of any stream of messages was literally in front of you, in flesh and blood.

In the print phase we have the first major decoupling. Whether in the form of a cuneiform tablet or a modern paperback, the provenance was no longer assured. We had to rely on "controlled" seals, trust the publishers and their distribution chains etc. Ultimately this worked because the relative difficulty of producing printed artifacts helped developed a legal/political apparatus to control the spread of the "fake" stuff.

Enter the digital communications era and we have the second major decoupling. The amount (and increasingly the apparent veracity) of generating human oriented messaging is no longer a limiting factor. This has no precedent. You can create now plausibly create a fake Wikipedia [1] by just running a model. The signal-to-noise of digitally exchanged messages can experience catastrophic collapse.

> A flood of synthetic content might not pose an existential threat to the progress of AI development, but it does threaten the digital public good of the (human) internet.

Indeed, this is the real risk. I don't care about idiot "AI" feeding on itself but I do care about destroying any basis of sane digital communication between humans.

Will we develop the legal/political apparatus to control the flood of algorithmic junk? Will it be remotely democratic? The best precedent of digital automation destroying communication channels and our apparent inability and/or unwillingness to do something about it is email spam [2]. Decades after it first appeared spam is still degrading our infosphere. The writing is on the wall.

[1] https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wi...

[1] https://en.wikipedia.org/wiki/Email_spam

By @botro - about 2 months
In the model testing I've conducted, I've seen that LLMs from competing companies including GPT-4o, Gemini Flash 1.5, Llama 3.1 and Phi-3 all converge on the exact same joke. For a test of creativity this was alarming. They all tell slight variations of the same joke about ladders.

I've posted about it here: https://news.ycombinator.com/item?id=41125309

By @viraptor - about 2 months
This could use a bit more nuance around training from AI. While the naive approaches will experience worse replies, there are many documented cases where the quality improves instead. Star, self-reflection, groups of agents, and likely others I don't know about all improve the results using only the same model's output.
By @imtringued - about 2 months
We don't need "more" data. We already have all the data we need in terms of quantity. We don't need more "synthetic" data for supervised learning. We need better training algorithms that go beyond minimizing token level loss.
By @mathw - about 2 months
Surely the only way you overcome this is to write a system which can actually generate something new, rather than regurgitate an incredibly complicated statistical reworking of things it got trained on.

I did note how the author said that training models on other models doesn't have the ethical implications of training on stolen human data - except it does, because where did the first model get its training set from? This is why we make it illegal not just to steal but also to handle stolen goods.

By @lifeisstillgood - about 2 months
>>> 98% of collected data was rejected.

I mean apart from wow ! 98% of the internet is shit (that’s higher than even I assumed) - how did they differentiate between good and shit? PageRank? Length? I can tell the difference between good writing and bad writing (and porn and erotica) but I have to read it using my brain based LLM - how did they do it ?

This leads to the whole open source, show us your training data thing