July 4th, 2024

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.

Read original article

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare has introduced a new feature to block web scrapers used by artificial intelligence (AI) companies from extracting website content. This feature is part of Cloudflare's content delivery network (CDN) and is available in both free and paid tiers. Many AI companies rely on web content for training their large language models (LLMs), and Cloudflare's tool aims to address the issue of some LLM developers not providing opt-out options for website operators. The feature utilizes AI to detect automated content extraction attempts, even those trying to evade detection by mimicking real browsers. Cloudflare will continuously update the feature to adapt to changes in AI scraping bots and is also launching a tool for website operators to report new bots encountered. The company's system assigns a score to website visits to identify potential bot activity, with requests from a bot collecting content for Perplexity AI consistently receiving low scores. Cloudflare's initiative aims to enhance website security and prevent unauthorized scraping by AI companies.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Akamai Technologies reports 42% of web traffic is bots, 65% malicious. Ecommerce faces challenges like data theft, fraud due to web scraper bots. Mitigation strategies and compliance considerations are advised.

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

Amazon's cloud division investigates Perplexity AI for potential scraping abuse, examining violations of AWS rules by using content from blocked websites. Concerns raised over copyright violations and compliance with AWS terms.

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

Cloudflare debuts one-click nuke of web-scraping AI

Cloudflare launches one-click solution to block AI bots scraping websites without permission. Aiming to combat dishonest AI bot activities, the feature complements robots.txt method, detecting and blocking bots disguising as browsers.

4 comments

By @unyttigfjelltol - 10 months

I use Perplexity and it says its process is to break down a request into a series of web searches that it conducts at central servers in real time basically at my request. It then reviews the pages for relevant information and provides a summary of sorts.

I can imagine this might result in Perplexity having a bigger visibility to web site owners, because it's contstantly doing bursts of searches from a central server that aren't labeled like a web scraper. But, that's exactly how it responds in real time to my user inputs and requests.

Cloudflare's service appears to target actual web scraping instead of real time searches, so I'm not sure they actually hit the nail on the head in mentioning Perplexity in relation to their service.

By @iruoy - 10 months

See https://news.ycombinator.com/item?id=40865627 for cloudflare's blogpost.

By @bob1029 - 10 months

What does the end game look like?

Browser automation tools imply that we can battle this forever. Much like with copyright and other forms of DRM/anti-cheating technology.

I don't believe there exists purpose in chasing this rabbit beyond making customers and investors believe they have a chance of actually catching it.

By @zx8080 - 10 months

It's probably going to be like a firewall or antivirus or "endpoint security" market. Protection from AI intelligence as-a-service.

Cloudflare rolls out feature for blocking AI companies' web scrapers

Related

OpenAI and Anthropic are ignoring robots.txt

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

Block AI bots, scrapers and crawlers with a single click

Cloudflare debuts one-click nuke of web-scraping AI

Related

OpenAI and Anthropic are ignoring robots.txt

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Amazon Is Investigating Perplexity over Claims of Scraping Abuse

Block AI bots, scrapers and crawlers with a single click

Cloudflare debuts one-click nuke of web-scraping AI