When A.I.'s Output Is a Threat to A.I. Itself
A.I. systems face quality degradation from training on their own outputs, risking "model collapse." Ensuring diverse, high-quality real-world data is essential to maintain effectiveness and reliability in A.I. applications.
Read original articleAs artificial intelligence (A.I.) continues to generate vast amounts of content, the risk of A.I.-produced data being used to train future A.I. systems is increasing. This creates a feedback loop where A.I. models may ingest their own outputs, leading to a decline in quality and diversity, a phenomenon termed "model collapse." Research indicates that when A.I. is repeatedly trained on its own output, it can produce increasingly blurred and less varied results, as seen in experiments with handwritten digits and language models. This degradation can manifest in various applications, such as medical chatbots providing less accurate information or image generators producing distorted visuals. The challenge is compounded by the prevalence of A.I.-generated content on the internet, which can contaminate training datasets. To mitigate these issues, A.I. companies must ensure their models are trained on high-quality, diverse real-world data rather than solely on synthetic outputs. Strategies include acquiring data from reliable sources and developing better detection methods for A.I. content. Without intervention, the erosion of diversity and quality in A.I. outputs could have significant implications for the technology's future effectiveness and reliability.
- A.I. systems risk degrading in quality when trained on their own outputs, leading to "model collapse."
- The prevalence of A.I.-generated content on the internet complicates the training of future A.I. models.
- Ensuring A.I. is trained on high-quality, diverse real-world data is crucial to maintaining output quality.
- Companies are exploring methods like watermarking to better detect A.I.-generated content.
- The decline in diversity of A.I. outputs could amplify biases and reduce the effectiveness of A.I. applications.
Related
NYT: The Data That Powers AI Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
The problem of 'model collapse': how a lack of human data limits AI progress
Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.
AI trained on AI garbage spits out AI garbage
Research from the University of Oxford reveals that AI models risk degradation due to "model collapse," where reliance on AI-generated content leads to incoherent outputs and declining performance.
'Model collapse'? An expert explains the rumours about an impending AI doom
Model collapse in AI refers to reduced effectiveness from reliance on AI-generated data. Concerns include diminished quality and diversity of outputs, prompting calls for better regulation and competition in the sector.
This...is already the case. Nothing about LLM architecture is biased towards facts. If they're factual at all, it's only because most things written by people are true (to the best of their knowledge, anyway).
There are certainly attempts to constrain LLMs to be more factual, but we still have a ways to go on those.
Data that you could guarantee was not generated by AI. Stuff like scans of old microfiche archives, scans of rare but unpopular books found in old boxes and attics. Things like that.
Free startup idea! Go to garage sales and look for boxes of books that maybe have never been digitized.
Just be careful that the entire training set doesn't have the moral values of 100 years ago!
(I think this is in "the minds I" (1981)
+-------------+
| |
| |
+--->LLM----->+
This reads like an extended sales pitch rather than an article. It's true that if you train the same model on nothing but a limited set of its own output for 30 epochs, you're gonna get a shitty model (although 30 epochs of finetuning will result in a pretty bad model on most datasets smaller than the original one). But paying the New York Times isn't going to change that: you can already mark articles pulled from its domain as being high-quality, human-sourced data, even if you don't pay them. If they win their lawsuits, they might be able to force model trainers to pay them for it, but that doesn't have anything to do with model collapse.
“‘Early in the Reticulum—thousands of years ago—it became almost useless because it was cluttered with faulty, obsolete, or downright misleading information,’ Sammann said.
“‘Crap, you once called it,’ I reminded him.
“‘Yes—a technical term. So crap filtering became important. Businesses were built around it. Some of those businesses came up with a clever plan to make more money: they poisoned the well. They began to put crap on the Reticulum deliberately, forcing people to use their products to filter that crap back out. They created syndevs whose sole purpose was to spew crap into the Reticulum. But it had to be good crap.’
“‘What is good crap?’ Arsibalt asked in a politely incredulous tone.
“‘Well, bad crap would be an unformatted document consisting of random letters. Good crap would be a beautifully typeset, well-written document that contained a hundred correct, verifiable sentences and one that was subtly false. It’s a lot harder to generate good crap. At first they had to hire humans to churn it out. They mostly did it by taking legitimate documents and inserting errors—swapping one name for another, say. But it didn’t really take off until the military got interested.’
“‘As a tactic for planting misinformation in the enemy’s reticules, you mean,’ Osa said. ‘This I know about. You are referring to the Artificial Inanity programs of the mid-First Millennium…’”
Badly-designed boats just don't return.
Ill-designed protection of cities means they'll be conquered.
Scientific ideas that do not corroborate, will be discarded.
etc.
Our current approach to AI doesn't have this mechanism. In the past, humanity just implemented ideas: a city was built according to some weird idea and lasted centuries. So the original idea would spread and be refined by further generations. I guess we need to bring such a mechanism into the loop.
You've got a fever, and the only prescription is MORE COWBELL
Can't wait to see what fixed points will emerge in this dynamical enshittification system.
Related
NYT: The Data That Powers AI Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.
AI models collapse when trained on recursively generated data
Recent research in Nature reveals "model collapse" in AI, where training on data from previous models leads to irreversible defects and misrepresentation of original data, emphasizing the need for genuine human-generated data.
The problem of 'model collapse': how a lack of human data limits AI progress
Research shows that using synthetic data for AI training can lead to significant risks, including model collapse and nonsensical outputs, highlighting the importance of diverse training data for accuracy.
AI trained on AI garbage spits out AI garbage
Research from the University of Oxford reveals that AI models risk degradation due to "model collapse," where reliance on AI-generated content leads to incoherent outputs and declining performance.
'Model collapse'? An expert explains the rumours about an impending AI doom
Model collapse in AI refers to reduced effectiveness from reliance on AI-generated data. Concerns include diminished quality and diversity of outputs, prompting calls for better regulation and competition in the sector.