Anthropic is scraping websites so fast it's causing problems
Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.
Read original articleAnthropic has been criticized for its aggressive web scraping practices while training its Claude language model. Reports indicate that the ClaudeBot has made excessive requests to various websites, with one site, Ifixit.com, experiencing a million hits in a single day. Freelancer.com reported an even more extreme case, with 3.5 million hits in just four hours, leading them to block the bot due to the disruption it caused to their operations. These actions have raised concerns about the ethical implications of data scraping, especially since Anthropic reportedly ignores directives set in robots.txt files, which are intended to guide web crawlers on what data can be accessed. Despite being founded by former OpenAI researchers with a commitment to developing responsible AI systems, Anthropic's current practices have drawn parallels to the broader issue of plagiarism in AI model training. Other organizations, like Read The Docs, have also called for more respectful behavior from AI crawlers. The situation highlights ongoing tensions between AI development and the rights of content creators, as companies seek to balance the need for data with ethical considerations.
Related
OpenAI and Anthropic are ignoring robots.txt
Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.
iFixit takes shots at Anthropic for hitting servers a million times in 24 hours
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. He expressed frustration over unauthorized scraping, highlighting concerns about AI companies accessing content without permission.
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. This incident highlights concerns about AI companies ignoring website policies and ethical data scraping issues.
Anthropic accused of 'egregious' data scraping
AI start-up Anthropic faces accusations of aggressive data scraping, disrupting web publishers' services. Despite claims of compliance, concerns grow over ethical practices and potential violations of terms of service.
Bytedances crawler (Bytespider) is another one which disregards robots.txt but still identifies itself, and you probably should block it because it's very aggressive.
It's going to get annoying fast when they inevitably go full blackhat and start masquerading as normal browser traffic.
doesn't seem supported by the citation, https://www.404media.co/websites-are-blocking-the-wrong-ai-s...
Some more discussion https://news.ycombinator.com/item?id=41060559
(disclaimer: I wrote this blog post)
They even respect extended robots.txt features like,
User-agent: *
Disallow: /library/*.pdf$
I make my websites for other people to see. They are not secrets I hoard who's value goes away when copied. The more copies and derivations the better.I guess ideas like creative commons and sharing go away when the smell of money enters the water. Better lock all your text behind paywalls so the evil corporations won't get it. Just be aware, for every incorporated entity you block you're blocking just as many humans with false positives, if not more. This anti-"scraping" hysteria is mostly profit motivated.
funny thing - with wasm, the web won't be scrappable.
Related
OpenAI and Anthropic are ignoring robots.txt
Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.
iFixit takes shots at Anthropic for hitting servers a million times in 24 hours
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. He expressed frustration over unauthorized scraping, highlighting concerns about AI companies accessing content without permission.
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
iFixit CEO takes shots at Anthropic for hitting servers a million times in 24h
iFixit CEO Kyle Wiens criticized Anthropic for making excessive requests to their servers, violating terms of service. This incident highlights concerns about AI companies ignoring website policies and ethical data scraping issues.
Anthropic accused of 'egregious' data scraping
AI start-up Anthropic faces accusations of aggressive data scraping, disrupting web publishers' services. Despite claims of compliance, concerns grow over ethical practices and potential violations of terms of service.