AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
Read original articleRead the Docs has reported a significant increase in abusive site crawling by AI products, leading to substantial bandwidth costs and operational challenges. The organization, which typically welcomes bots, has had to block several sources of abusive traffic due to aggressive crawling behavior. For instance, one crawler downloaded 73 TB of data in May 2024, costing over $5,000 in bandwidth charges, while another used Facebook's content downloader to pull 10 TB of data in June 2024. These crawlers often lack basic checks, such as rate limiting and support for caching mechanisms, resulting in repeated downloads of unchanged files.
In response to this abuse, Read the Docs has implemented temporary blocks on identified AI crawlers and is enhancing monitoring and rate limiting measures. The organization emphasizes the need for AI companies to adopt more respectful crawling practices to avoid backlash and potential blocking from various sites. They propose collaboration with AI companies to develop integrations that would allow for more efficient and respectful data retrieval, such as alerts for content changes. Read the Docs calls for the implementation of basic checks in crawlers to prevent further issues, highlighting the importance of maintaining good relationships within the community while addressing the challenges posed by current crawling practices.
Related
OpenAI and Anthropic are ignoring robots.txt
Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.
We need an evolved robots.txt and regulations to enforce it
In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.
Block AI bots, scrapers and crawlers with a single click
Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.
Cloudflare rolls out feature for blocking AI companies' web scrapers
Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.
iFixit takes shots at Anthropic for hitting servers a million times in 24 hours
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. He expressed frustration over unauthorized scraping, highlighting concerns about AI companies accessing content without permission.
They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.
Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.
To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.
While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.
Wow. That's some seriously disrespectful crawling.
Over 99% of the bandwidth (and CPU) taken by the biggest podcast / music services simply on polling feeds is completely unnecessary. But ofc pointing this out to them gets some sort of "oh this is normal, we don't care" response because they are big enough to know that eg podcasters need them.
> Clients (hopefully bots) that disregard robots.txt and connect to your instance of HellPot will suffer eternal consequences. HellPot will send an infinite stream of data that is just close enough to being a real website that they might just stick around until their soul is ripped apart and they cease to exist.
As someone who written a lot of crawling infrastructure and managed large scale crawling operations, respectful crawling is important.
That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value. It sometimes felt hard to reach out to websites and ask them to allow our crawlers, and grey tactics were often used. And I'm always for a more open internet.
I think regular releases of content in a compressed format would go a long way, but there would always be a race for having the freshest content. What might be better is offering the content in machine format, XML or JSON or even SOAP. Which is usually better for what the sites crawling want to achieve, cheaper for you to serve, and cheaper and less resource intensive compared to crawling. (Have them "cache" locally by enforcing rate limiting and signup)
1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)
2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)
Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:
- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them
- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale
- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time
- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information
- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure
Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.
The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.
This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.
Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.
Invoice the abusers.
They're rolling in investor hype money, and they're obviously not spending it on competent developers if their bots behave like this, so there should be plenty left to cover costs.
You would be fooling yourselves if you think such a firm cared about robots.txt or page tags.
We warned them they would be sued eventually, to contact the site owners for legal access to the data, and issued a hard pass on the project. Probably they assumed if the indexing process was out of another jurisdiction their domestic firm wouldn't be liable for theft of service or copyright infringement.
It was my understanding AI/ML does not change legal obligations in business, but the firm probably found someone to build that dubious project eventually...
Spider traps and rate-limiting are good options too. =3
Hoping they just stop seems futile…
Related
OpenAI and Anthropic are ignoring robots.txt
Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.
We need an evolved robots.txt and regulations to enforce it
In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.
Block AI bots, scrapers and crawlers with a single click
Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.
Cloudflare rolls out feature for blocking AI companies' web scrapers
Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.
iFixit takes shots at Anthropic for hitting servers a million times in 24 hours
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. He expressed frustration over unauthorized scraping, highlighting concerns about AI companies accessing content without permission.