December 17th, 2024

The Rise of the AI Crawler

AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript rendering, leading to inefficiencies. Recommendations include server-side rendering and efficient URL management for better accessibility.

Read original article

AI crawlers have emerged as a significant force on the web, with OpenAI's GPTBot and Anthropic's Claude generating substantial traffic across Vercel's network. In the past month, GPTBot made 569 million requests, while Claude followed with 370 million, together accounting for about 28% of Googlebot's total requests. Despite their growing presence, AI crawlers face challenges, particularly in handling JavaScript rendering. None of the major AI crawlers, including ChatGPT and Claude, currently execute JavaScript, which limits their ability to access client-side rendered content. This contrasts with Googlebot, which effectively renders JavaScript. The analysis also revealed that AI crawlers exhibit inefficiencies, such as high rates of 404 errors and redirects, indicating a need for improved URL management. Furthermore, AI crawlers prioritize different content types, with ChatGPT focusing on HTML and Claude on images. Recommendations for web developers include prioritizing server-side rendering for critical content and maintaining efficient URL management to enhance crawler accessibility. For those wishing to restrict crawler access, using robots.txt and Vercel's firewall options is advised. Overall, while AI crawlers are rapidly scaling, they still require optimization to effectively navigate and index modern web applications.

- AI crawlers are generating significant web traffic, with GPTBot and Claude leading in requests.

- Major AI crawlers do not execute JavaScript, limiting their access to dynamic content.

- High rates of 404 errors and redirects indicate inefficiencies in AI crawler behavior.

- Recommendations include server-side rendering for critical content and efficient URL management.

- Web developers can use robots.txt and firewalls to control crawler access.

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Anthropic is scraping websites so fast it's causing problems

Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.

Websites Are Blocking the Wrong AI Scrapers

Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.

AI Has Created a Battle over Web Crawling

The rise of generative AI has prompted websites to restrict data access via robots.txt, leading to concerns over declining training data quality and potential impacts on AI model performance.

Nearly 90% of our AI crawler traffic is from ByteDance

Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.

2 comments

By @arlattimore - 4 months

The inefficiencies in the crawling for these AI products is surely going to get better in a hurry, they'll be burning through resources/money as it stands.

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

AI Has Created a Battle over Web Crawling

The rise of generative AI has prompted websites to restrict data access via robots.txt, leading to concerns over declining training data quality and potential impacts on AI model performance.

Nearly 90% of our AI crawler traffic is from ByteDance

Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.

The Rise of the AI Crawler

Related

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

AI Has Created a Battle over Web Crawling

Nearly 90% of our AI crawler traffic is from ByteDance

Related

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

AI Has Created a Battle over Web Crawling

Nearly 90% of our AI crawler traffic is from ByteDance