July 3rd, 2024

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

Read original article

Block AI bots, scrapers and crawlers with a single click

Cloudflare has introduced a new feature to block AI bots with a single click, aiming to protect content creators from dishonest AI companies scraping websites without transparency. The tool is available for all customers, including those on the free tier, and can be activated in the Security > Bots section of the Cloudflare dashboard. The company identified popular AI bots like Bytespider, Amazonbot, ClaudeBot, and GPTBot, highlighting their activities and the need to block them. Cloudflare's machine learning models can detect AI bots pretending to be real web browsers, ensuring accurate bot identification. Website operators are encouraged to report misbehaving AI crawlers to Cloudflare for investigation. The company continues to enhance its bot detection capabilities to safeguard the Internet environment for content creators and maintain control over content usage for training AI models.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

We need an evolved robots.txt and regulations to enforce it

In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Akamai Technologies reports 42% of web traffic is bots, 65% malicious. Ecommerce faces challenges like data theft, fraud due to web scraper bots. Mitigation strategies and compliance considerations are advised.

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

Amazon's cloud division investigates Perplexity AI for potential scraping abuse, examining violations of AWS rules by using content from blocked websites. Concerns raised over copyright violations and compliance with AWS terms.

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Microsoft warns of "Skeleton Key" attack exploiting AI models to generate harmful content. Mark Russinovich stresses the need for model-makers to address vulnerabilities. Advanced attacks like BEAST pose significant risks. Microsoft introduces AI security tools.

14 comments

By @iknownothow - 10 months

I've been inadvertently working on this topic and I'd like to share some findings.

* Do not confuse bots with DDoS. While bot traffic may end up overwhelming your server, your DDoS SaaS will not stop that traffic unless you have some kind of bot protection enabled, for example the product described in post.

* A lot of bots announce themselves via user agents, some don't.

* If you're running an ecom shop with a lot of product pages, expect a large portion of traffic to be bots and scrapers. In our case it was upto 50%, which was surprising.

* Some bots accept cookies and these skew your product analytics.

* We enabled automatic bot protection and a of lot our third party integrations ended up being marked as bots and their traffic was blocked. We eventually turned that off.

* (EDIT) Any sophisticated self implemented bot protection isn't worth the effort for most companies out there. But I have to admit, it's very exciting to think about all the ways to block bots.

What's our current status? We've enabled monitoring to keep a look out for DDoS attempts but we're taking the hit on bot traffic. The data on our the website isn't really private info, except maybe pricing, and we're really unsure how to think about the new AI bots scraping this information. ChatGPT already gives a summary of what our company does. We don't know if that's a good thing or not. Would be happy to hear anyone's thoughts on how to think about this topic.

By @OutOfHere - 10 months

It says "Declare your independence", but your independence is exactly what you stand to lose if you channel your traffic through Cloudflare. You already have your independence; don't give it up to those who appeal to desperation to fool you into believing the opposite of what's true.

By @reustle - 10 months

We are witnessing the last dying breaths of the open internet. Cloudflare in the middle of all traffic, web assembly, etc.

By @Xeamek - 10 months

Does google effectively gets a pass, because they (can) use the same bot to index websites for search and to scrap data for AI models training at the same time?

By @tehryanx - 10 months

I find it slightly ironic that they're only able to do this effectively because they've been able to train their own detection model on traffic, mostly from users that have never agreed to anything.

I don't have strong opinions on this either way really, I just found that a bit funny.

By @acheong08 - 10 months

There are so many things sites need to protect against these days it’s making independent self hosting quite annoying. As bots get better at hiding, only companies with huge scale like Cloudflare would be able to identify and block them. DDOS/bot providers are unintentionally creating a monopoly

By @anthonyhn - 10 months

For those not using cloudflare but who have access to web server config files and want to block AI bots, I put together a set of prebuilt configs[0] (for Apache, Nginx, Lighttpd, and Caddy) that will block most AI bots from scraping contents. The configs are built on top of public data sources[1] with various adjustments.

[0] https://github.com/anthmn/ai-bot-blocker

[1] https://darkvisitors.com/

By @notachatbot1234 - 10 months

*by channeling all your traffic through Cloudflare.

By @nubinetwork - 10 months

Surprise surprise... bytespider is at the top of the list.

By @speckx - 10 months

I don't see the option to enable this on my Pro sites; however, I see it on my free sites.

By @skeledrew - 10 months

It'll be so interesting to see what sorts of "biases" future AI models will manifest when they're only trained on a fraction of the web. All any group with an agenda has to do is make their content available for training, with the knowledge/hope that many of those with balancing content will have it blocked. And then there will be increased complaints re said "biases" by the same ones who endorse blocking, without a thought that the issue was amplified by said blocking. And of course use cases for AI will continue to broaden, in most cases without a care for those spouting about "biases". It'll be a wonderful world.

Block AI bots, scrapers and crawlers with a single click

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

'Skeleton Key' attack unlocks the worst of AI, says Microsoft

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

'Skeleton Key' attack unlocks the worst of AI, says Microsoft