July 25th, 2024

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Read original articleLink Icon
AI crawlers need to be more respectful

Read the Docs has reported a significant increase in abusive site crawling by AI products, leading to substantial bandwidth costs and operational challenges. The organization, which typically welcomes bots, has had to block several sources of abusive traffic due to aggressive crawling behavior. For instance, one crawler downloaded 73 TB of data in May 2024, costing over $5,000 in bandwidth charges, while another used Facebook's content downloader to pull 10 TB of data in June 2024. These crawlers often lack basic checks, such as rate limiting and support for caching mechanisms, resulting in repeated downloads of unchanged files.

In response to this abuse, Read the Docs has implemented temporary blocks on identified AI crawlers and is enhancing monitoring and rate limiting measures. The organization emphasizes the need for AI companies to adopt more respectful crawling practices to avoid backlash and potential blocking from various sites. They propose collaboration with AI companies to develop integrations that would allow for more efficient and respectful data retrieval, such as alerts for content changes. Read the Docs calls for the implementation of basic checks in crawlers to prevent further issues, highlighting the importance of maintaining good relationships within the community while addressing the challenges posed by current crawling practices.

Related

OpenAI and Anthropic are ignoring robots.txt

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

We need an evolved robots.txt and regulations to enforce it

We need an evolved robots.txt and regulations to enforce it

In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.

Block AI bots, scrapers and crawlers with a single click

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.

iFixit takes shots at Anthropic for hitting servers a million times in 24 hours

iFixit takes shots at Anthropic for hitting servers a million times in 24 hours

iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. He expressed frustration over unauthorized scraping, highlighting concerns about AI companies accessing content without permission.

Link Icon 28 comments
By @everforward - 4 months
I'm curious if there's a point where a crawler is misconfigured so badly that it becomes a violation of the CFAA by nature of recklessness.

They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.

Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.

To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.

While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.

By @simonw - 4 months
"One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Wow. That's some seriously disrespectful crawling.

By @DamonHD - 4 months
Not just AI: here is my current side-quest: https://www.earth.org.uk/RSS-efficiency.html

Over 99% of the bandwidth (and CPU) taken by the biggest podcast / music services simply on polling feeds is completely unnecessary. But ofc pointing this out to them gets some sort of "oh this is normal, we don't care" response because they are big enough to know that eg podcasters need them.

By @mikae1 - 4 months
HellPot: https://github.com/yunginnanet/HellPot

> Clients (hopefully bots) that disregard robots.txt and connect to your instance of HellPot will suffer eternal consequences. HellPot will send an infinite stream of data that is just close enough to being a real website that they might just stick around until their soul is ripped apart and they cease to exist.

By @Venn1 - 4 months
I blocked Microsoft/OpenAI a few weeks ago for (semi) childish reasons. Seven months later, Bing still refuses to index my blog, despite scraping it daily. The AI scrapers and crawlers toggle on Cloudflare did the trick.
By @winddude - 4 months
Not only that, even commoncrawl had issues (about a year ago) where AWS couldn't keep up with the demand for downloading the WARCs.

As someone who written a lot of crawling infrastructure and managed large scale crawling operations, respectful crawling is important.

That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value. It sometimes felt hard to reach out to websites and ask them to allow our crawlers, and grey tactics were often used. And I'm always for a more open internet.

I think regular releases of content in a compressed format would go a long way, but there would always be a race for having the freshest content. What might be better is offering the content in machine format, XML or JSON or even SOAP. Which is usually better for what the sites crawling want to achieve, cheaper for you to serve, and cheaper and less resource intensive compared to crawling. (Have them "cache" locally by enforcing rate limiting and signup)

By @pants2 - 4 months
While the crawling is disrespectful, it seems RTD could find a cheaper host for their files. At my work we have a 10G business fiber line and serve >1PB per month for around $1,500. Takes 90% of the load off our cloud services. Took me just a couple weeks to set up everything.
By @exhaze - 4 months
Having built an AI crawler myself for first party data collection:

1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)

2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)

Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:

- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them

- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale

- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time

- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information

- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure

Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.

The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.

This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.

Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.

By @int3 - 4 months
Shouldn't all sites have some kind of bandwidth / cost limiting in place? Not to say that AI crawlers shouldn't be more careful, but there are always malicious actors on the internet, seems foolish not to have some kind of defense in place
By @mateozaratefw - 4 months
Tik Tok crawler fucked us up by taking a product-name (e-commerce) and inserting it into the search bar recursively with the results page. Respect the game, but not respecting the robots.txt crawl delay is awful.
By @troupo - 4 months
What amazes me that none of this is surprising, all this behavior (not just what's described in the post) is on par with what the companies are doing, and have been doing for decades... And yet there will be many people, including here on HN, who will just cheer these companies on because they spit out an "opensource model" or a 10-dollars-a-month subscription
By @johneth - 4 months
> "One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Invoice the abusers.

They're rolling in investor hype money, and they're obviously not spending it on competent developers if their bots behave like this, so there should be plenty left to cover costs.

By @bo1024 - 4 months
It would be nice to have more of a peer-to-peer infrastructure (torrent inspired) for serving big resources.
By @RecycledEle - 4 months
I suspect the AI companies are as careless with their training as they are with their web scraping.
By @Joel_Mckay - 4 months
Had a conversation with a firm that wanted a distributed scraper built, and they really did not care about site usage policies.

You would be fooling yourselves if you think such a firm cared about robots.txt or page tags.

We warned them they would be sued eventually, to contact the site owners for legal access to the data, and issued a hard pass on the project. Probably they assumed if the indexing process was out of another jurisdiction their domestic firm wouldn't be liable for theft of service or copyright infringement.

It was my understanding AI/ML does not change legal obligations in business, but the firm probably found someone to build that dubious project eventually...

Spider traps and rate-limiting are good options too. =3

By @bakugo - 4 months
The words "AI" and "respectful" don't belong in the same sentence. The mere concept of generative AI is disrespectful.
By @asdasdsddd - 4 months
Why can't you just rate limit non-browser user agents very aggressively, if your primary audience is human.
By @guhcampos - 4 months
We all know wheret his is headed. In 5 years every useful content in the Web will be behind a paywall.
By @andrei_says_ - 4 months
They will be once we legislate that respect. Not a second earlier.
By @surfingdino - 4 months
Sites need to start suing crawler operators for bandwidth costs.
By @joshu - 4 months
everything old is new again. i remember when someone at google started aggressively crawling del.icio.us from a desktop machine and i ended up blocking all employees...
By @Pesthuf - 4 months
Have capitalists ever stopped just because their actions (that make them money) hurt others? Because the consequences of the damage they cause might in the end hurt them, too?

Hoping they just stop seems futile…

By @xyst - 4 months
So much waste. Even worse than the digital currency rush.
By @croemer - 4 months
Just 2 buggy crawlers seems not that many, sure they each had large impact, but given that there are likely hundreds if not thousands of such crawlers out there it's a rather small number. It seems that most crawlers are actually respectful.