August 26th, 2024

When A.I.'s Output Is a Threat to A.I. Itself

A.I. systems face quality degradation from training on their own outputs, risking "model collapse." Ensuring diverse, high-quality real-world data is essential to maintain effectiveness and reliability in A.I. applications.

Read original articleLink Icon
When A.I.'s Output Is a Threat to A.I. Itself

As artificial intelligence (A.I.) continues to generate vast amounts of content, the risk of A.I.-produced data being used to train future A.I. systems is increasing. This creates a feedback loop where A.I. models may ingest their own outputs, leading to a decline in quality and diversity, a phenomenon termed "model collapse." Research indicates that when A.I. is repeatedly trained on its own output, it can produce increasingly blurred and less varied results, as seen in experiments with handwritten digits and language models. This degradation can manifest in various applications, such as medical chatbots providing less accurate information or image generators producing distorted visuals. The challenge is compounded by the prevalence of A.I.-generated content on the internet, which can contaminate training datasets. To mitigate these issues, A.I. companies must ensure their models are trained on high-quality, diverse real-world data rather than solely on synthetic outputs. Strategies include acquiring data from reliable sources and developing better detection methods for A.I. content. Without intervention, the erosion of diversity and quality in A.I. outputs could have significant implications for the technology's future effectiveness and reliability.

- A.I. systems risk degrading in quality when trained on their own outputs, leading to "model collapse."

- The prevalence of A.I.-generated content on the internet complicates the training of future A.I. models.

- Ensuring A.I. is trained on high-quality, diverse real-world data is crucial to maintaining output quality.

- Companies are exploring methods like watermarking to better detect A.I.-generated content.

- The decline in diversity of A.I. outputs could amplify biases and reduce the effectiveness of A.I. applications.

Link Icon 13 comments
By @KingMob - about 2 months
> Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction.

This...is already the case. Nothing about LLM architecture is biased towards facts. If they're factual at all, it's only because most things written by people are true (to the best of their knowledge, anyway).

There are certainly attempts to constrain LLMs to be more factual, but we still have a ways to go on those.

By @jedberg - about 2 months
I had an idea a couple of weeks ago -- vintage data.

Data that you could guarantee was not generated by AI. Stuff like scans of old microfiche archives, scans of rare but unpopular books found in old boxes and attics. Things like that.

Free startup idea! Go to garage sales and look for boxes of books that maybe have never been digitized.

Just be careful that the entire training set doesn't have the moral values of 100 years ago!

By @ggm - about 2 months
Either Daniel Dennett or one of his friends wrote a thought experiment story about the worldwide brain science drive where every participating scientist gets one neuron from the disected brain to keep alive in a tank, observing its firing patterns. One day one of them notices his neuron has died, so he quietly wires the input to the output through a circuit which fires the same way..

(I think this is in "the minds I" (1981)

By @brudgers - about 2 months
Because GOGI is cheaper than Mechanical Turk.

  +-------------+
  |             |
  |             |
  +--->LLM----->+
By @nyc111 - about 2 months
By @reissbaker - about 2 months
Color me unsurprised that the New York Times is writing an article drumming up fear around synthetic data and training on the internet, and concludes that the solution is for "A.I. companies to pay for this data instead of scooping it up from the internet, ensuring both human origin and high quality . . . there’s no replacement for the real thing."

This reads like an extended sales pitch rather than an article. It's true that if you train the same model on nothing but a limited set of its own output for 30 epochs, you're gonna get a shitty model (although 30 epochs of finetuning will result in a pretty bad model on most datasets smaller than the original one). But paying the New York Times isn't going to change that: you can already mark articles pulled from its domain as being high-quality, human-sourced data, even if you don't pay them. If they win their lawsuits, they might be able to force model trainers to pay them for it, but that doesn't have anything to do with model collapse.

By @fallous - about 2 months
Neal Stephenson's Anathem:

“‘Early in the Reticulum—thousands of years ago—it became almost useless because it was cluttered with faulty, obsolete, or downright misleading information,’ Sammann said.

“‘Crap, you once called it,’ I reminded him.

“‘Yes—a technical term. So crap filtering became important. Businesses were built around it. Some of those businesses came up with a clever plan to make more money: they poisoned the well. They began to put crap on the Reticulum deliberately, forcing people to use their products to filter that crap back out. They created syndevs whose sole purpose was to spew crap into the Reticulum. But it had to be good crap.’

“‘What is good crap?’ Arsibalt asked in a politely incredulous tone.

“‘Well, bad crap would be an unformatted document consisting of random letters. Good crap would be a beautifully typeset, well-written document that contained a hundred correct, verifiable sentences and one that was subtly false. It’s a lot harder to generate good crap. At first they had to hire humans to churn it out. They mostly did it by taking legitimate documents and inserting errors—swapping one name for another, say. But it didn’t really take off until the military got interested.’

“‘As a tactic for planting misinformation in the enemy’s reticules, you mean,’ Osa said. ‘This I know about. You are referring to the Artificial Inanity programs of the mid-First Millennium…’”

By @renewiltord - about 2 months
It's going to be fine. The tool is fantastic and still getting better. We have enough content already for the base models and then we can take care of the new stuff. It's going to be non-trivial but it'll work.
By @Sparkyte - about 2 months
AI's novelty is wearing off where it starts to fail to produce immediate growth to investors. Today's investors are caught up exploiting latest thing then burning out because they released something half baked.
By @surfingdino - about 2 months
Yup. It's VIGO--the new business model of Value In-Garbage Out.
By @mo_42 - about 2 months
Humanity has bootstrapped itself out of a lot of BS over the centuries. There's a mechanism for discarding bad ideas. For example:

Badly-designed boats just don't return.

Ill-designed protection of cities means they'll be conquered.

Scientific ideas that do not corroborate, will be discarded.

etc.

Our current approach to AI doesn't have this mechanism. In the past, humanity just implemented ideas: a city was built according to some weird idea and lasted centuries. So the original idea would spread and be refined by further generations. I guess we need to bring such a mechanism into the loop.

By @romwell - about 2 months
Tomorrow's LLM-based doctor:

You've got a fever, and the only prescription is MORE COWBELL

Can't wait to see what fixed points will emerge in this dynamical enshittification system.

By @llamataboot - about 2 months
Disappointed they didn't even mention intentional data poisoning - we simply have no idea what trapdoors are being laid in training data yet to be ingested...