September 9th, 2024

"Model collapse" threatens to kill progress on generative AIs

Generative AI faces a challenge called "model collapse," where training on synthetic data leads to nonsensical outputs. Researchers emphasize the need for high-quality training data to prevent biases and inaccuracies.

Read original article

"Model collapse" threatens to kill progress on generative AIs

Generative AI has gained significant attention since the release of OpenAI's ChatGPT in 2022, but a new challenge known as "model collapse" threatens its advancement. This phenomenon occurs when generative AIs, which rely on vast amounts of training data, begin to produce nonsensical outputs after being trained on synthetic data generated by other AIs. Researchers have demonstrated that when a model is repeatedly trained on its own outputs, it accumulates errors, leading to a decline in quality. This recursive training process can result in outputs that are increasingly disconnected from reality, akin to repeatedly scanning and printing a flawed image. The issue is exacerbated by the growing prevalence of AI-generated content on the internet, which may inadvertently be incorporated into training datasets. Experts are concerned that even a small amount of synthetic data can skew the model's outputs, leading to biases and inaccuracies. As developers seek solutions, they face the challenge of ensuring high-quality training data while navigating the complexities of AI-generated content. Potential strategies include implementing advanced detection tools and human evaluation of synthetic data, but scalability remains a concern. The urgency to address these issues is heightened by the rapid proliferation of AI-generated material online.

- "Model collapse" occurs when generative AIs produce nonsensical outputs after training on synthetic data.

- Recursive training on flawed outputs leads to a decline in quality and accuracy.

- The prevalence of AI-generated content on the internet complicates the training data landscape.

- Even small amounts of synthetic data can introduce biases in AI outputs.

- Developers are exploring solutions like detection tools and human evaluation to maintain data quality.

AI models collapse when trained on recursively generated data

Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.

The problem of 'model collapse': how a lack of human data limits AI progress

Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.

'Model collapse'? An expert explains the rumours about an impending AI doom

Model collapse in AI refers to reduced effectiveness from reliance on AI-generated data. Concerns include diminished quality and diversity of outputs, prompting calls for better regulation and competition in the sector.

When A.I.'s Output Is a Threat to A.I. Itself

A.I. systems face quality degradation from training on their own outputs, risking "model collapse." Ensuring diverse, high-quality real-world data is essential to maintain effectiveness and reliability in A.I. applications.

Is AI Killing Itself–and the Internet?

Recent research reveals "model collapse" in generative AI, where reliance on AI-generated content degrades output quality. With 57% of web text AI-generated, concerns grow about disinformation and content integrity.

2 comments

By @sim7c00 - 7 months

I feel this is kind of obvious. I'd think that research would go into an area exploring other models than an LLM as that has these obvious drawbacks. Another being the sheer computing power needed to train this, even if you had infinite good quality data. There is not enough power to train these things. It's hit a wall.

The article only goes into results based on current models though. I'd hope there will be different kinds of models, which might produce more accurate results with less data, and optimizing in that direction instead. for instance, all information on how to write code is available, yet training a current model on all that information does not yield a model which can program all things. There's different types of information involved, as well as a different type of 'inspiration' or 'creativity' that a model might posses in order to utilize the training data optimally.

That being said I know next to nothing on how these things are built or where research is going now. It just seems this article is overly focused on LLMs being the ultimate thing, and having more data the only option to improve generative AI. I don't think that's true. We just need to invent new ways rather than trying to scale up the old ones.

By @thebruce87m - 7 months

Ever heard of pre-war steel? https://en.m.wikipedia.org/wiki/Low-background_steel

Same deal with AI - pre-ai content will have more value.

"Model collapse" threatens to kill progress on generative AIs

Related

AI models collapse when trained on recursively generated data

The problem of 'model collapse': how a lack of human data limits AI progress

'Model collapse'? An expert explains the rumours about an impending AI doom

When A.I.'s Output Is a Threat to A.I. Itself

Is AI Killing Itself–and the Internet?

Related

AI models collapse when trained on recursively generated data

The problem of 'model collapse': how a lack of human data limits AI progress

'Model collapse'? An expert explains the rumours about an impending AI doom

When A.I.'s Output Is a Threat to A.I. Itself

Is AI Killing Itself–and the Internet?