December 19th, 2024

The Rise of the AI Crawler

AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript execution and efficiency, prompting recommendations for server-side rendering and better URL management.

Read original article

AI crawlers have emerged as a significant force on the web, with OpenAI's GPTBot and Anthropic's Claude generating substantial traffic across Vercel's network. In the past month, GPTBot made 569 million requests, while Claude followed with 370 million, together accounting for about 28% of Googlebot's total requests. Despite their growing presence, AI crawlers face challenges, particularly in handling JavaScript and efficiently crawling web content. Analysis shows that these crawlers do not execute JavaScript, which limits their ability to access dynamic content. They also exhibit inefficiencies, with high rates of 404 errors and redirects, indicating a need for better URL management. The study highlights distinct content-fetching patterns, with ChatGPT favoring HTML and Claude focusing on images. Recommendations for web developers include prioritizing server-side rendering for critical content and maintaining efficient URL management to enhance crawler accessibility. For those wishing to restrict crawler access, using robots.txt and Vercel's firewall options is advised. Overall, while AI crawlers are rapidly scaling, they still lag behind traditional search engines in terms of efficiency and content handling.

- AI crawlers are generating significant web traffic, with GPTBot and Claude leading the way.

- These crawlers do not execute JavaScript, limiting their access to dynamic content.

- High rates of 404 errors indicate inefficiencies in AI crawler behavior.

- Recommendations include server-side rendering for critical content and efficient URL management.

- Web developers can use robots.txt and firewalls to control crawler access.

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Anthropic is scraping websites so fast it's causing problems

Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.

Websites Are Blocking the Wrong AI Scrapers

Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.

Nearly 90% of our AI crawler traffic is from ByteDance

Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.

The Rise of the AI Crawler

AI crawlers like GPTBot and Claude are generating significant web traffic but struggle with JavaScript rendering, leading to inefficiencies. Recommendations include server-side rendering and efficient URL management for better accessibility.

2 comments

By @keyle - 4 months

That was very interesting. It's amusing to me how Google's crawler is more efficient than the competition. Maybe Google isn't a search business as much as a crawler business now!

I've thought about updating my robots.txt but I really don't see the point. It's a cat and mouse game and there is no 'delete previous fetch' in robots.txt. I seriously doubt that they would update their index to remove contents if the robots.txt changed over time. Besides, it's going to be death by a thousand paper cuts, keeping up with all these robots. Most of the young ones certainly wouldn't abide by robots.txt anyway as they're just in alpha, right?...

The cat's out of the bag now... The only real purpose of having a restrictive robots.txt is to reduce traffic, imho. Or to help these robots find content that they wouldn't normally dig "that deep".

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

Nearly 90% of our AI crawler traffic is from ByteDance

Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.

The Rise of the AI Crawler

Related

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

Nearly 90% of our AI crawler traffic is from ByteDance

The Rise of the AI Crawler

Related

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

Nearly 90% of our AI crawler traffic is from ByteDance

The Rise of the AI Crawler