July 25th, 2024

AI trained on AI garbage spits out AI garbage

Research from the University of Oxford reveals that AI models risk degradation due to "model collapse," where reliance on AI-generated content leads to incoherent outputs and declining performance.

Read original articleLink Icon
AI trained on AI garbage spits out AI garbage

As AI-generated content proliferates online, the quality of AI models trained on this data is at risk of degradation, according to new research published in Nature. The study, led by Ilia Shumailov from the University of Oxford, highlights a phenomenon known as "model collapse," where models produce incoherent outputs due to training on data generated by other AI models. This process is likened to repeatedly photographing a photograph, leading to a loss of clarity and detail. The research indicates that as AI models increasingly rely on junk content from the internet, their performance may decline, with slower improvements and higher perplexity scores indicating less accurate predictions.

The study involved fine-tuning a large language model (LLM) on Wikipedia data and then on its own outputs over nine generations, revealing that the quality of generated text deteriorated significantly. Experts emphasize the importance of high-quality, diverse training data, noting that reliance on synthetic data could exacerbate issues, particularly for underrepresented languages. To mitigate degradation, one suggestion is to ensure models retain access to original human-generated data. However, distinguishing between human and AI-generated content remains a challenge. The findings underscore the need for careful consideration of data provenance to maintain the integrity and reliability of AI models in the face of increasing AI-generated noise on the internet.

Related

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google researchers warn about generative AI's negative impact on the internet, creating fake content blurring authenticity. Misuse includes manipulating human likeness, falsifying evidence, and influencing public opinion for profit. AI integration raises concerns.

The Data That Powers A.I. Is Disappearing Fast

The Data That Powers A.I. Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.

AI models collapse when trained on recursively generated data

AI models collapse when trained on recursively generated data

Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.

The problem of 'model collapse': how a lack of human data limits AI progress

The problem of 'model collapse': how a lack of human data limits AI progress

Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google researchers warn that generative AI contributes to the spread of fake content, complicating the distinction between truth and deception, and potentially undermining public understanding and accountability in digital information.

Link Icon 5 comments
By @gnabgib - 4 months
Discussion (248 points, 1 day ago, 175 comments) https://news.ycombinator.com/item?id=41058194
By @bdjsiqoocwk - 4 months
I would like to hear what kind of mental model of the world you have if you thought otherwise.
By @pm2222 - 4 months
Well what else could it be? After all ai is just polynomial approximation in essence.