NYT: The Data That Powers AI Is Disappearing Fast
A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.
Read original articleA recent study by the Data Provenance Initiative reveals a significant decline in the availability of data crucial for training artificial intelligence (A.I.) models. The study found that key web sources used for A.I. training have imposed restrictions on data usage, leading to an "emerging crisis in consent." Approximately 5% of data in commonly used A.I. training sets has been restricted, with up to 45% of data in some sets limited by websites' terms of service. This trend poses challenges for A.I. developers, researchers, and noncommercial entities reliant on public data sets. Companies like OpenAI, Google, and Meta have faced obstacles in gathering high-quality data, prompting some to seek partnerships with publishers for ongoing data access. The study underscores the need for new tools to enable website owners to control data usage more precisely. As A.I. companies navigate data restrictions and seek alternative training methods like synthetic data, the industry faces uncertainties regarding the future availability and quality of training data.
- Concerns about monopolies: Some fear that licensing regimes could lead to monopolies by large companies, restricting individual use of AI tools.
- Data consent and ethics: Many discuss the lack of consent for using data from platforms like YouTube and the ethical implications of scraping data without permission.
- Alternative data sources: There is debate over the potential of synthetic data to replace human-created data, with some optimistic about its future use.
- Technical challenges: Issues with aggressive bots ignoring robots.txt and causing server problems are mentioned as practical reasons for data restrictions.
- Future of AI development: Some see the data scarcity as a push towards new AI architectures that mimic human learning, while others are skeptical about the sustainability of current AI progress.
Currently, I think most of the training use cases can be covered by the existing "you can't copyright a fact" carve out in the law. That's probably better for society and creators than my licensing regime scenario.
Anyway, I'm rooting for "no regulation" for now. The whole industry is still being screwed over by market distortions created by the DMCA, and this could easily be 10x worse.
If anyone would like to join in, there's an actively maintained robots.txt here:
https://github.com/ai-robots-txt/ai.robots.txt
Yes, I know this isn't legally binding and scrapers can ignore it if they want to.
I'm thinking synthetic datasets are how it's going to go. Of course you can't get information from nothing, but they might not need nearly as much seed data to generate lots of examples that are specifically designed to train reasoning skills.
Maybe it will take a while to get over the hump, but they'll be motivated to make it work.
https://www.nytimes.com/2024/07/19/technology/generative-ai-...
https://web.archive.org/web/20240720013944if_/https://archiv...
This is how I read NYT now:
tnftp -4o"|yy093" https://www.nytimes.com/2024/07/19/technology/ai-data-restrictions.html > 1.htm
links 1.htm
Using HTTP/1.1 pipelining, i.e., yy025 and tcpclient instead of tnftp, I retrieve many articles in a single HTTP request.yy093 outputs a single HTML page with all the fulltext articles, the way I like it. 100% Javascript-free.
The web keeps getting better for text-only. More JSON, less HTML.
That is such a weird and misleading way to put it. There was no consent in the first place. Take YouTube for example. Google did not consent to the videos it hosts being used by OpenAI. The uploader certainly did not ever consent to their face, voice and content being used to train models either.
Eager to create "safety" the wikipedia-ish discussion where no two truths can exist simultaneously will happen in private and the official narrative of everything however preposterous will be the only one.
We should also create LLM driven content moderation so that wrong think can be silenced immediately.
The masses of brainwashed slaves will no doubt learn to like it.
It's easier to ask for forgiveness than consent,
the cost of doing business,
etcetera, etcetera.
Well so it's not really disappearing at all.
having worked in this topic, creating synthetic data is not always easy, and you still need real data to get the best results. if you look beyond basic internet media, there are tons of fields where "real data" does not exist in quantities that could help create effective models. these might be good case studies on how to proceed further in small endeavours.
"Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers."
And also this:
"Mr. Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data."
This will happen once everything is looted/not needed anymore.
The truth is that no one knows where the dust will settle. All covered wagons are venturing west of the Rockies at the same time. Content creators are not really facing a worse outlook than everyone else. They are, however, in a better position to hinder the advance.
In principle, an AI learning from a scientific textbook is no different than a human student doing the same. Neither will violate copyright law when they're done learning, except perhaps accidentally - paraphrasing and facts are not violations. Unfortunately, legal and ethical principles can differ from legal reality. We're left hoping that some altruistic legal minds will open up a Northwest Passage for us, like Thomas Penfield Jackson in US v. Microsoft.
The worst possible outcome is that we end up with an all-powerful AI cartel which negotiates massive deals with IP conglomerates, locking out competition and open and free alternatives.
I mean the entire industry is based on doing stuff without seeking consent.
All the major players seem to have used at some point at least the books dataset so def no regard for consent or copyright.
Bigger issue is mechanical limits like Twitter and Reddit - walled gardens of info. That could entrench existing players, whether via money (pay me for data), ethics (oh now suddenly consent matters) or just timing (more stuff mechanically restricted)
I doubt they can continue the progress at the same speed they have been at so far. Because the game is set to become more difficult.
.... says the article behind a paywall.
TBH I would not feel any sadness if LLM models plateaued, greatly decellerated development, or even -regressed- as they start to ingest all the garbage already being spewed by LLMs.
This is just a recent hype which relies on getting insane amounts of data to train, but we had and will have AI models that do not rely on training using data without consent.