July 24th, 2024

The problem of 'model collapse': how a lack of human data limits AI progress

Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.

Read original article

The problem of 'model collapse': how a lack of human data limits AI progress

Research indicates that the reliance on synthetic data to train artificial intelligence (AI) models poses significant risks, potentially leading to nonsensical outcomes. Major AI companies, including OpenAI and Microsoft, have explored using synthetic data as they exhaust available human-generated data. A study published in Nature highlights that synthetic data can cause rapid degradation of AI models, with one experiment showing a model's output devolving into irrelevant topics after just a few generations of training. The study emphasizes that the accumulation of errors in successive training cycles can lead to a phenomenon known as "model collapse," where the model loses its utility due to overwhelming inaccuracies.

The research found that early signs of collapse involve a loss of variance, resulting in overrepresentation of majority subpopulations while minority groups are neglected. As the collapse progresses, the output may become increasingly nonsensical. The study also noted that models trained on their own outputs tend to produce repetitive phrases and distorted representations of data. To mitigate these issues, some companies are implementing techniques like watermarking AI-generated content to prevent it from being used in training datasets. However, effective coordination among tech companies remains a challenge. The findings suggest that companies that have sourced diverse training data from the pre-AI internet may have a competitive advantage in developing more accurate generative AI models.

Google Researchers Publish Paper About How AI Is Ruining the Internet

Google researchers warn about generative AI's negative impact on the internet, creating fake content blurring authenticity. Misuse includes manipulating human likeness, falsifying evidence, and influencing public opinion for profit. AI integration raises concerns.

Synthetic User Research Is a Terrible Idea

Synthetic User Research criticized by Matthew Smith for bias in AI outputs. Using AI may lead to generalized results and fabricated specifics, hindering valuable software development insights. Smith advocates for controlled, unbiased research methods.

The Data That Powers A.I. Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.

NYT: The Data That Powers AI Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.

AI models collapse when trained on recursively generated data

Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.

10 comments

By @joe_the_user - 9 months

Even if the models don't collapse, it seems intuitively obvious that you can't get more "knowledge" out of synthetically generated data than went into it's generation.

By @dspillett - 9 months

We aren't getting the results predicted by the Dead Internet Theory, we are instead getting the Hapsburg Internet.

By @mort96 - 9 months

Archive without paywall: https://archive.is/xh2fn

By @ThrowawayTestr - 9 months

Why don't these companies spend a bunch of money digitizing old texts? There's bound to be tons of good training data that isn't online.

By @miohtama - 9 months

Summary what will happen

> The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish.

By @nanomonkey - 9 months

If that is the case, why don't AI models have built in feedback loops, where the humans that use them can continuously work with the model and provide further human data. The edge models can then merge in their modifications to a public model, or be island and refined based upon the human's needs.

By @dang - 9 months

Related ongoing thread:

AI models collapse when trained on recursively generated data - https://news.ycombinator.com/item?id=41058194 - July 2024 (64 comments)