September 1st, 2024

AI Has Created a Battle over Web Crawling

The rise of generative AI has prompted websites to restrict data access via robots.txt, leading to concerns over declining training data quality and potential impacts on AI model performance.

Read original articleLink Icon
AI Has Created a Battle over Web Crawling

The rise of generative AI has led to a significant shift in how websites manage their data, particularly through the use of the robots.txt protocol, which restricts web crawlers from accessing certain content. As AI companies rely heavily on vast datasets sourced from the web, many organizations are increasingly concerned about their data being used without consent, prompting them to implement these restrictions. A report from the Data Provenance Initiative highlights that a growing number of high-quality websites are blocking crawlers, which could lead to a decline in the quality of training data available for AI models. The report indicates that from 2023 to 2024, there has been a notable increase in domains restricting access, with 25% of data from the top 2,000 websites in a popular dataset being revoked. This trend raises concerns about the future performance of AI models, as they may rely more on lower-quality data from personal blogs and e-commerce sites. The situation is further complicated by the legal ambiguities surrounding robots.txt, which is not legally enforceable, and the ongoing debates about data usage rights. As a result, AI companies may need to explore alternative data acquisition strategies, including licensing data directly or investing in synthetic data, to maintain the quality and relevance of their training datasets.

- Websites are increasingly using robots.txt to restrict access to their data for AI crawlers.

- A significant portion of high-quality data is being revoked, impacting the training of generative AI models.

- Legal ambiguities exist around the enforceability of robots.txt restrictions.

- AI companies may need to license data or rely on synthetic data to continue model training.

- The shift in data availability could lead to a decline in the performance of future AI models.

Related

OpenAI and Anthropic are ignoring robots.txt

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

We need an evolved robots.txt and regulations to enforce it

We need an evolved robots.txt and regulations to enforce it

In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.

The Data That Powers A.I. Is Disappearing Fast

The Data That Powers A.I. Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and companies like OpenAI, Google, and Meta. Challenges prompt exploration of new data access tools and alternative training methods.

NYT: The Data That Powers AI Is Disappearing Fast

NYT: The Data That Powers AI Is Disappearing Fast

A study highlights a decline in available data for training A.I. models due to restrictions from web sources, affecting A.I. developers and researchers. Companies explore partnerships and new tools amid data challenges.

AI crawlers need to be more respectful

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Link Icon 12 comments
By @kelsey98765431 - 5 months
Just about 2 years ago, a long dormant project surged back to life becoming one of the best crawlers out there. Zimit was originally made to scrape mediawiki sites but it now is able to crawl literally anything into an offline archive file. I have been trying to grab the data i am guessing will shortly be put under much stricter anti scraping controls soon, and I am not the only one hoarding data. The winds are blowing towards a closed internet faster than I have ever seen in my life. Get it while you can.
By @0cf8612b2e1e - 5 months
Curious what people think is an appropriate request rate for crawling a web page. I have seen many examples where author will spin up N machines with M threads and just hammer a server up until it starts returning more than a certain failure rate.

I have never done anything serious, but have always tried to keep my hit rate fairly modest.

By @mfiro - 5 months
I think, one other side effect of this is the increasing restrictions on VPN usage for accessing big websites, pushing users towards logging in or using (mobile) apps instead. Recent examples include X, Reddit, and more recently, YouTube, which has started blocking VPNs.

I'm also concerned that Free and open APIs might become a thing of the past as more "AI-driven" web scrapers/crawlers begin to overwhelm them.

By @zzo38computer - 5 months
I think that it is helpful for public mirrors to be made, in case the original files are lost or if the original server is temporarily off, or if you want to be able to access it without needing to access the same servers all the time (including if you have no internet connection at all but you might have a local copy); you can make your own copies from mirrors and from those too etc. Cryptographic hashing can be used too (in order to check that it matches by a shorter code than the entire file). However, they should not make excessive amount of attempted access, so I do block everything in robots.txt (but it is OK if someone wants to use curl to download files, or clone a repository, and then wants to mirror them; I also mirror some of my own stuff on GitHub). What I do not want others to do is to claim that I wrote something that actually I did not write, or to claim copyright on something that I wrote that will restrict their use by others.
By @pton_xd - 5 months
Is there an agreed upon best effort robots.txt that I can reference to cut out crawlers?

Or a list of IP addresses or user agents or other identifying info known to be used by AI crawlers?

Ultimately pointless but just curious.

By @superkuh - 5 months
Another way of stating it: now that worthless text and images can be made to be worth money people no longer want to have public websites and are trying to change the longstanding culture of an open remixable web to fit their new paranoia over not getting their cut.

I welcome all useragents and have been delighted to see openai, anthropic, microsoft, and other AI related crawlers mirroring my public website(s). Just the same as when I see googlebot or bingbot or firefox using humans. For information that I don't want to be public I simply don't put it on public websites.

By @Venn1 - 5 months
I activated the Block AI Scrapers and Crawlers feature on Cloudflare months ago. When I checked the event logs, I was surprised by the persistent attempts from GPTBot, PetalBot, Amazonbot, and PerplexityBot to crawl my site. They were making multiple requests per hour.

Considering my blog's niche focus on A/V production with Linux, I can only imagine how much more frequent their crawling would be on more popular websites.

By @hn_throwaway_99 - 5 months
I feel like talking about robots.txt in this context is kind of a pointless enterprise given how there's no guarantee it will be followed by crawlers (and TFA fully acknowledges this). Before AI, there was a mutually (not necessarily equal, but mutual) beneficial economic arrangement where websites published open content freely, and search engines indexed that content. That arrangement fundamentally no longer exists, and we can't pretend it's coming back. The end game of this is more and stronger paywalls (and not ones easily bypassed by incognito mode), and I think that's inevitable.
By @philipwhiuk - 5 months
> It’s also the case that preferences shouldn’t be respected in all cases. For instance, I don’t think that academics or journalists doing prosocial research should necessarily be foreclosed from accessing data with machines that is already public, on websites that anyone could go visit themselves.

Screw that. Research is about ethical consent. Also this is very much "well everyone should let me (a researcher) access whatever I like"

By @cactusplant7374 - 5 months
Is there a way to see if images on my website are in a training data set?
By @padthai - 5 months
Getting Covid paper toilet crisis vibes here