December 19th, 2024

The Rise of the AI Crawler

AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript execution and efficiency, prompting recommendations for server-side rendering and better URL management.

Read original articleLink Icon
The Rise of the AI Crawler

AI crawlers have emerged as a significant force on the web, with OpenAI's GPTBot and Anthropic's Claude generating substantial traffic across Vercel's network. In the past month, GPTBot made 569 million requests, while Claude followed with 370 million, together accounting for about 28% of Googlebot's total requests. Despite their growing presence, AI crawlers face challenges, particularly in handling JavaScript and efficiently crawling web content. Analysis shows that these crawlers do not execute JavaScript, which limits their ability to access dynamic content. They also exhibit inefficiencies, with high rates of 404 errors and redirects, indicating a need for better URL management. The study highlights distinct content-fetching patterns, with ChatGPT favoring HTML and Claude focusing on images. Recommendations for web developers include prioritizing server-side rendering for critical content and maintaining efficient URL management to enhance crawler accessibility. For those wishing to restrict crawler access, using robots.txt and Vercel's firewall options is advised. Overall, while AI crawlers are rapidly scaling, they still lag behind traditional search engines in terms of efficiency and content handling.

- AI crawlers are generating significant web traffic, with GPTBot and Claude leading the way.

- These crawlers do not execute JavaScript, limiting their access to dynamic content.

- High rates of 404 errors indicate inefficiencies in AI crawler behavior.

- Recommendations include server-side rendering for critical content and efficient URL management.

- Web developers can use robots.txt and firewalls to control crawler access.

Link Icon 2 comments
By @keyle - about 1 month
That was very interesting. It's amusing to me how Google's crawler is more efficient than the competition. Maybe Google isn't a search business as much as a crawler business now!

I've thought about updating my robots.txt but I really don't see the point. It's a cat and mouse game and there is no 'delete previous fetch' in robots.txt. I seriously doubt that they would update their index to remove contents if the robots.txt changed over time. Besides, it's going to be death by a thousand paper cuts, keeping up with all these robots. Most of the young ones certainly wouldn't abide by robots.txt anyway as they're just in alpha, right?...

The cat's out of the bag now... The only real purpose of having a restrictive robots.txt is to reduce traffic, imho. Or to help these robots find content that they wouldn't normally dig "that deep".