September 9th, 2024

"Model collapse" threatens to kill progress on generative AIs

Generative AI faces a challenge called "model collapse," where training on synthetic data leads to nonsensical outputs. Researchers emphasize the need for high-quality training data to prevent biases and inaccuracies.

Read original articleLink Icon
"Model collapse" threatens to kill progress on generative AIs

Generative AI has gained significant attention since the release of OpenAI's ChatGPT in 2022, but a new challenge known as "model collapse" threatens its advancement. This phenomenon occurs when generative AIs, which rely on vast amounts of training data, begin to produce nonsensical outputs after being trained on synthetic data generated by other AIs. Researchers have demonstrated that when a model is repeatedly trained on its own outputs, it accumulates errors, leading to a decline in quality. This recursive training process can result in outputs that are increasingly disconnected from reality, akin to repeatedly scanning and printing a flawed image. The issue is exacerbated by the growing prevalence of AI-generated content on the internet, which may inadvertently be incorporated into training datasets. Experts are concerned that even a small amount of synthetic data can skew the model's outputs, leading to biases and inaccuracies. As developers seek solutions, they face the challenge of ensuring high-quality training data while navigating the complexities of AI-generated content. Potential strategies include implementing advanced detection tools and human evaluation of synthetic data, but scalability remains a concern. The urgency to address these issues is heightened by the rapid proliferation of AI-generated material online.

- "Model collapse" occurs when generative AIs produce nonsensical outputs after training on synthetic data.

- Recursive training on flawed outputs leads to a decline in quality and accuracy.

- The prevalence of AI-generated content on the internet complicates the training data landscape.

- Even small amounts of synthetic data can introduce biases in AI outputs.

- Developers are exploring solutions like detection tools and human evaluation to maintain data quality.

Link Icon 2 comments
By @sim7c00 - 7 months
I feel this is kind of obvious. I'd think that research would go into an area exploring other models than an LLM as that has these obvious drawbacks. Another being the sheer computing power needed to train this, even if you had infinite good quality data. There is not enough power to train these things. It's hit a wall.

The article only goes into results based on current models though. I'd hope there will be different kinds of models, which might produce more accurate results with less data, and optimizing in that direction instead. for instance, all information on how to write code is available, yet training a current model on all that information does not yield a model which can program all things. There's different types of information involved, as well as a different type of 'inspiration' or 'creativity' that a model might posses in order to utilize the training data optimally.

That being said I know next to nothing on how these things are built or where research is going now. It just seems this article is overly focused on LLMs being the ultimate thing, and having more data the only option to improve generative AI. I don't think that's true. We just need to invent new ways rather than trying to scale up the old ones.

By @thebruce87m - 7 months
Ever heard of pre-war steel? https://en.m.wikipedia.org/wiki/Low-background_steel

Same deal with AI - pre-ai content will have more value.