July 31st, 2024

Websites Are Blocking the Wrong AI Scrapers

Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.

Read original article

Websites Are Blocking the Wrong AI Scrapers

Many websites are mistakenly blocking outdated AI scrapers from Anthropic while leaving its current scraper, CLAUDEBOT, unblocked. This confusion arises from website owners copying and pasting old robots.txt instructions, which do not reflect the latest changes in AI crawling technology. Anthropic has confirmed that its older bots, ANTHROPIC-AI and CLAUDE-WEB, are no longer in use and that CLAUDEBOT is now the active crawler. However, many popular sites, including Reuters and Condé Nast, have not updated their blocklists, allowing CLAUDEBOT to scrape their content freely. The operator of Dark Visitors, a site that tracks web crawlers, noted that the rapid evolution of AI scrapers makes it challenging for website owners to keep their blocklists current. This has led to some sites blocking all crawlers or only allowing a few, which can inadvertently restrict access for legitimate services like search engines and academic research tools. The Data Provenance Initiative highlighted the burden on website owners to manage these evolving agents, as many are unaware of which bots are active or who operates them. The situation has prompted calls for AI companies to be more respectful of website owners' preferences and for creators to consider paywalls to protect their content from unregulated scraping. Overall, the landscape of AI scrapers is complex and constantly changing, leading to significant confusion among content creators and website operators.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

Cloudflare rolls out feature for blocking AI companies' web scrapers

Cloudflare introduces a new feature to block AI web scrapers, available in free and paid tiers. It detects and combats automated extraction attempts, enhancing website security against unauthorized scraping by AI companies.

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Anthropic accused of 'egregious' data scraping

AI start-up Anthropic faces accusations of aggressive data scraping, disrupting web publishers' services. Despite claims of compliance, concerns grow over ethical practices and potential violations of terms of service.

Anthropic is scraping websites so fast it's causing problems

Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.

1 comments

Websites Are Blocking the Wrong AI Scrapers

Related

OpenAI and Anthropic are ignoring robots.txt

Cloudflare rolls out feature for blocking AI companies' web scrapers

AI crawlers need to be more respectful

Anthropic accused of 'egregious' data scraping

Anthropic is scraping websites so fast it's causing problems

Related

OpenAI and Anthropic are ignoring robots.txt

Cloudflare rolls out feature for blocking AI companies' web scrapers

AI crawlers need to be more respectful

Anthropic accused of 'egregious' data scraping

Anthropic is scraping websites so fast it's causing problems