The Rise of the AI Crawler
AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript rendering, leading to inefficiencies. Recommendations include server-side rendering and efficient URL management for better accessibility.
Read original articleAI crawlers have emerged as a significant force on the web, with OpenAI's GPTBot and Anthropic's Claude generating substantial traffic across Vercel's network. In the past month, GPTBot made 569 million requests, while Claude followed with 370 million, together accounting for about 28% of Googlebot's total requests. Despite their growing presence, AI crawlers face challenges, particularly in handling JavaScript rendering. None of the major AI crawlers, including ChatGPT and Claude, currently execute JavaScript, which limits their ability to access client-side rendered content. This contrasts with Googlebot, which effectively renders JavaScript. The analysis also revealed that AI crawlers exhibit inefficiencies, such as high rates of 404 errors and redirects, indicating a need for improved URL management. Furthermore, AI crawlers prioritize different content types, with ChatGPT focusing on HTML and Claude on images. Recommendations for web developers include prioritizing server-side rendering for critical content and maintaining efficient URL management to enhance crawler accessibility. For those wishing to restrict crawler access, using robots.txt and Vercel's firewall options is advised. Overall, while AI crawlers are rapidly scaling, they still require optimization to effectively navigate and index modern web applications.
- AI crawlers are generating significant web traffic, with GPTBot and Claude leading in requests.
- Major AI crawlers do not execute JavaScript, limiting their access to dynamic content.
- High rates of 404 errors and redirects indicate inefficiencies in AI crawler behavior.
- Recommendations include server-side rendering for critical content and efficient URL management.
- Web developers can use robots.txt and firewalls to control crawler access.
Related
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
Anthropic is scraping websites so fast it's causing problems
Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.
Websites Are Blocking the Wrong AI Scrapers
Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.
AI Has Created a Battle over Web Crawling
The rise of generative AI has prompted websites to restrict data access via robots.txt, leading to concerns over declining training data quality and potential impacts on AI model performance.
Nearly 90% of our AI crawler traffic is from ByteDance
Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.
Related
AI crawlers need to be more respectful
Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.
Anthropic is scraping websites so fast it's causing problems
Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.
Websites Are Blocking the Wrong AI Scrapers
Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.
AI Has Created a Battle over Web Crawling
The rise of generative AI has prompted websites to restrict data access via robots.txt, leading to concerns over declining training data quality and potential impacts on AI model performance.
Nearly 90% of our AI crawler traffic is from ByteDance
Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.