July 21st, 2024

NYT: The Data That Powers AI Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.

Read original articleLink Icon
DataConsentRegulation
NYT: The Data That Powers AI Is Disappearing Fast

A recent study by the Data Provenance Initiative reveals a significant decline in the availability of data crucial for training artificial intelligence (A.I.) models. The study found that key web sources used for A.I. training have imposed restrictions on data usage, leading to an "emerging crisis in consent." Approximately 5% of data in commonly used A.I. training sets has been restricted, with up to 45% of data in some sets limited by websites' terms of service. This trend poses challenges for A.I. developers, researchers, and noncommercial entities reliant on public data sets. Companies like OpenAI, Google, and Meta have faced obstacles in gathering high-quality data, prompting some to seek partnerships with publishers for ongoing data access. The study underscores the need for new tools to enable website owners to control data usage more precisely. As A.I. companies navigate data restrictions and seek alternative training methods like synthetic data, the industry faces uncertainties regarding the future availability and quality of training data.

AI: What people are saying
The comments on the article about the decline in available data for training AI models due to web restrictions highlight several key points:
  • Concerns about monopolies: Some fear that licensing regimes could lead to monopolies by large companies, restricting individual use of AI tools.
  • Data consent and ethics: Many discuss the lack of consent for using data from platforms like YouTube and the ethical implications of scraping data without permission.
  • Alternative data sources: There is debate over the potential of synthetic data to replace human-created data, with some optimistic about its future use.
  • Technical challenges: Issues with aggressive bots ignoring robots.txt and causing server problems are mentioned as practical reasons for data restrictions.
  • Future of AI development: Some see the data scarcity as a push towards new AI architectures that mimic human learning, while others are skeptical about the sustainability of current AI progress.
Link Icon 35 comments
By @neonate - 4 months
By @hedora - 4 months
I think the worst possible outcome is a licensing regime that means that Disney or Paramount or Elsevier or whoever all get to have a monopoly on training large models within their niche. My guess is that any successful calls for regulation will have this outcome, which means that individuals won't be able to legally use AI-based tools except when creating works for hire, etc.

Currently, I think most of the training use cases can be covered by the existing "you can't copyright a fact" carve out in the law. That's probably better for society and creators than my licensing regime scenario.

Anyway, I'm rooting for "no regulation" for now. The whole industry is still being screwed over by market distortions created by the DMCA, and this could easily be 10x worse.

By @jsheard - 4 months
> Those restrictions are set up through the Robots Exclusion Protocol

If anyone would like to join in, there's an actively maintained robots.txt here:

https://github.com/ai-robots-txt/ai.robots.txt

Yes, I know this isn't legally binding and scrapers can ignore it if they want to.

By @thatxliner - 4 months
I would be fine if you use my data to train your AI models if you let me use your models for free. If you can’t do that, you can’t have my data.
By @lionelholt - 4 months
The decision to block bots is not always about protecting intellectual property. A practical consideration I haven't seen mentioned is that some of these AI bots are stupidly aggressive with their requests, even ignoring robots.txt. I had to activate Cloudflare WAF and block a variety of bots to prevent my web app servers from crashing. At least they're reasonable enough to identify themselves!
By @skybrian - 4 months
> Some companies believe they can scale the data wall by using synthetic data — that is, data that is itself generated by A.I. systems — to train their models. But many researchers doubt that today’s A.I. systems are capable of generating enough high-quality synthetic data to replace the human-created data they’re losing.

I'm thinking synthetic datasets are how it's going to go. Of course you can't get information from nothing, but they might not need nearly as much seed data to generate lots of examples that are specifically designed to train reasoning skills.

Maybe it will take a while to get over the hump, but they'll be motivated to make it work.

By @buildbot - 4 months
Well, unless you exclude common crawl and block all robots…it’s still going to end up in a dataset someday. Or deleted and gone forever!
By @rectang - 4 months
The NYT had another article a couple of days ago about Getty leveraging the images it owns to go into AI.

https://www.nytimes.com/2024/07/19/technology/generative-ai-...

By @thriftwy - 4 months
I look forward for AI trained entirely on Wikipedia and classical literature with no twitter and no contemporary art in sight. It would be sublime. Let's face it, the creators of century XXI way overestimate the importance of their stuff. It's mostly deleterious to the culture.
By @1vuio0pswjnm7 - 4 months
Works where archive.ph is blocked:

https://web.archive.org/web/20240720013944if_/https://archiv...

This is how I read NYT now:

   tnftp -4o"|yy093" https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html > 1.htm
   links 1.htm
Using HTTP/1.1 pipelining, i.e., yy025 and tcpclient instead of tnftp, I retrieve many articles in a single HTTP request.

yy093 outputs a single HTML page with all the fulltext articles, the way I like it. 100% Javascript-free.

The web keeps getting better for text-only. More JSON, less HTML.

By @janice1999 - 4 months
> We’re seeing a rapid decline in consent to use data across the web

That is such a weird and misleading way to put it. There was no consent in the first place. Take YouTube for example. Google did not consent to the videos it hosts being used by OpenAI. The uploader certainly did not ever consent to their face, voice and content being used to train models either.

By @skeledrew - 4 months
On one hand this really sucks, particularly for newcomers starting from scratch, as now only the larger companies that already scraped the web can carry on in this vein, finding ways to improve the architecture to better utilize the data they already have. On the other hand, I see this being a forcing function to move away from generative AI sooner, which I've always considered to be a dead end for AGI, to future architectures that train on data similar to how humans learn, ie raw audio, video, etc streams from the environment.
By @6510 - 4 months
The only interesting part is that it will become possible to remove a book without leaving an empty spot on the shelve.

Eager to create "safety" the wikipedia-ish discussion where no two truths can exist simultaneously will happen in private and the official narrative of everything however preposterous will be the only one.

We should also create LLM driven content moderation so that wrong think can be silenced immediately.

The masses of brainwashed slaves will no doubt learn to like it.

By @amelius - 4 months
Let's not fool ourselves and think that these big AI companies care about licenses.

It's easier to ask for forgiveness than consent,

the cost of doing business,

etcetera, etcetera.

By @GaggiX - 4 months
>Those restrictions are set up through the Robots Exclusion Protocol

Well so it's not really disappearing at all.

By @rldjbpin - 4 months
while "data is the new oil" might be so late 2000s and early 10s, it is finally coming to fruition in public discourse after people realized that one can dump them into a model and get something you can work with.

having worked in this topic, creating synthetic data is not always easy, and you still need real data to get the best results. if you look beyond basic internet media, there are tons of fields where "real data" does not exist in quantities that could help create effective models. these might be good case studies on how to proceed further in small endeavours.

By @lofaszvanitt - 4 months
Yes, this is the key point:

"Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers."

And also this:

"Mr. Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data."

This will happen once everything is looted/not needed anymore.

By @worstspotgain - 4 months
Outside of tech, AI isn't making any friends. It's an unexpected threat on the horizon for the bearers of accumulated privilege, such as monopolies and nation-state autocrats. They hear echoes of the computer and internet revolutions, which were key in upending many prior "garden patches" of power. Their cronies have been stirring up resistance among workers on the front lines of disruption, such as some content creators.

The truth is that no one knows where the dust will settle. All covered wagons are venturing west of the Rockies at the same time. Content creators are not really facing a worse outlook than everyone else. They are, however, in a better position to hinder the advance.

In principle, an AI learning from a scientific textbook is no different than a human student doing the same. Neither will violate copyright law when they're done learning, except perhaps accidentally - paraphrasing and facts are not violations. Unfortunately, legal and ethical principles can differ from legal reality. We're left hoping that some altruistic legal minds will open up a Northwest Passage for us, like Thomas Penfield Jackson in US v. Microsoft.

The worst possible outcome is that we end up with an all-powerful AI cartel which negotiates massive deals with IP conglomerates, locking out competition and open and free alternatives.

By @jll29 - 4 months
"disapparing" = people getting aware that their data has value, and setting their robots.txt permissions acordingly?
By @adultSwim - 4 months
The data is already incorporated into existing models. Those models will generate derivative data used to train the next models. They only needed to rip it all off once, after that it's actually preferable to not let anyone do it again.
By @globalnode - 4 months
There will be a new law which states that copyrights don't have effect if the data is being used to train machine learning models (except in the EU perhaps? heh).
By @thatxliner - 4 months
I would be fine if you used my data to train your AI models as long as I’m able to use your models for free in return. If not, you can’t have my data.
By @Havoc - 4 months
> consent

I mean the entire industry is based on doing stuff without seeking consent.

All the major players seem to have used at some point at least the books dataset so def no regard for consent or copyright.

Bigger issue is mechanical limits like Twitter and Reddit - walled gardens of info. That could entrench existing players, whether via money (pay me for data), ethics (oh now suddenly consent matters) or just timing (more stuff mechanically restricted)

By @masherm - 4 months
What if everyone resorts to training on synthetic data ...
By @wkat4242 - 4 months
In the near future I see less real, quality human data that will go behind paywalls, but also much more AI generated data feeding into the next generation of AI. Because more and more people are using it to publish stuff online. Which then gets scooped up by AI training crawlers. And if it made sense to train an LLM on its own output, it would be done already :)

I doubt they can continue the progress at the same speed they have been at so far. Because the game is set to become more difficult.

By @cozzyd - 4 months
Is it easy to recognize the AI content stealers by user agent? Can we just feed them garbage when detected?
By @1024core - 4 months
> The data that powers AI is disappearing fast

.... says the article behind a paywall.

By @bastien2 - 4 months
.
By @greatpostman - 4 months
Ai negativity clickbait
By @doctorpangloss - 4 months
“Tell HN: Stop Reading The New York Times”
By @to11mtm - 4 months
It honestly makes me hesitant to publish my special personal code on gists or private repos vs at least 3-2 backup the really good stuff...

TBH I would not feel any sadness if LLM models plateaued, greatly decellerated development, or even -regressed- as they start to ingest all the garbage already being spewed by LLMs.

By @blackeyeblitzar - 4 months
We should make the data on large platforms like YouTube and social media in general accessible to all companies for AI use (with the actual creator’s positive consent).
By @elAhmo - 4 months
NYT assumes that LLMs = AI, which is far from truth.

This is just a recent hype which relies on getting insane amounts of data to train, but we had and will have AI models that do not rely on training using data without consent.