October 31st, 2024

Nearly 90% of our AI crawler traffic is from ByteDance

Nearly 90% of HAProxy's AI crawler traffic is from Bytedance's Bytespider, highlighting the need for businesses to balance increased visibility with risks of content scraping and misrepresentation.

Read original article

Nearly 90% of our AI crawler traffic is from ByteDance

HAProxy's analysis reveals that nearly 90% of its AI crawler traffic originates from Bytedance's Bytespider, which is used to gather content for generative AI models. This significant traffic, constituting about 1% of total visits, highlights the growing influence of AI crawlers on web content. While these crawlers can enhance brand visibility through AI chatbots, they also pose risks of content scraping and misrepresentation. Businesses must decide whether to allow AI crawlers, balancing the potential for increased exposure against the risk of unauthorized content use. To mitigate these risks, companies can employ strategies such as using the robots.txt file, although many AI crawlers do not comply with these directives. HAProxy offers a Bot Management Module that helps identify and manage these bots effectively. The data collected through HAProxy Edge allows for improved bot detection and management, utilizing machine learning to enhance security measures against unwanted traffic. As AI crawler activity continues to rise, businesses are encouraged to develop strategies to protect their content while considering the implications of AI on their digital presence.

- Nearly 90% of AI crawler traffic to HAProxy comes from Bytedance's Bytespider.

- AI crawlers can enhance brand visibility but also risk unauthorized content use.

- Businesses must weigh the benefits of AI exposure against content protection.

- HAProxy provides tools for effective bot management and content protection.

- The rise of AI crawler activity necessitates strategic responses from content-heavy websites.

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Akamai Technologies reports 42% of web traffic is bots, 65% malicious. Ecommerce faces challenges like data theft, fraud due to web scraper bots. Mitigation strategies and compliance considerations are advised.

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

AI crawlers need to be more respectful

Read the Docs has reported increased abusive AI crawling, leading to high bandwidth costs. They are blocking offenders and urging AI companies to adopt respectful practices and improve crawler efficiency.

Anthropic is scraping websites so fast it's causing problems

Anthropic faces criticism for aggressive web scraping while training its Claude model, causing disruptions to websites like Ifixit.com and Freelancer.com, raising ethical concerns about data usage and content creator rights.

Websites Are Blocking the Wrong AI Scrapers

Outdated robots.txt instructions are causing confusion, blocking old AI scrapers while allowing CLAUDEBOT to scrape freely. Many sites haven't updated their blocklists, complicating management for website owners.

9 comments

By @mmastrac - 6 months

I found that I was getting random bot attacks on progscrape.com with no identifiable bot signature (ie: a signature matching a valid Chrome Desktop client), but at a rate that was only possible via bot. I ended up having to add token buckets by IP/User Agent to help avoid this deluge of traffic.

Agents that trigger the first level of rate-limiting go through a "tarpit" that holds their connection for a bit before serving it which seems to keep most of the bad actors in check. It's impossible to block them via robots.txt, and I'm trying to avoid using too big of a hammer on my CloudFlare settings.

EDIT: checking the logs, it seems that the only bot getting tarpitted right now is OpenAI, and they _do_ have a GPTBot signature:

    2024-10-31T02:30:23.312139Z  WARN progscrape::web: User hit soft rate limit: ratelimit=soft ip="20.171.206.77" browser=Some("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)") method=GET uri=/?search=science.org

By @jhpacker - 6 months

Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google: https://radar.cloudflare.com/explorer?dataSet=ai.bots And that's not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc.

Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.

By @neilv - 6 months

Given the high-profile national security scrutiny that ByteDance was already in over TikTok, and now with the AI training competitiveness on national authorities' minds, maybe this behavior by ByteDance is on the radar of someone who's thinking of whether CFAA or other regulation applies.

As someone who's built multiple (respectful) Web crawlers, for academic research and for respectable commerce, I'm wondering whether abusers are going to make it harder for legitimate crawlers to operate.

By @wtf242 - 6 months

I had the same issue with TikTok/ByteDance. They were using almost 100gb of my traffic per month.

I now block all ai crawlers at the cloudflare WAF level. On Monday I noticed a HUGE spike in traffic and my site was not handling it well. After a lot of troubleshooting and log parsing, I was getting millions of requests from China that were getting past cloudflare's bot protection.

I ended up having to force a CF managed challenge for the entire country of China to get my site back in a normal working state.

In the past 24 hours CF has blocked 1.66M bot requests. Good luck running a site without using CloudFlare or something similar.

AI crawlers are just out of control

By @PittleyDunkin - 6 months

How do you differentiate between "ai" (whatever that means) and other crawlers?

By @odc - 6 months

Good to know there are other solutions than Cloudflare to block those leeches.

By @sghiassy - 6 months

It’s 90% of 1%… title is misleading

By @yazzku - 6 months

tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.

Nearly 90% of our AI crawler traffic is from ByteDance

Related

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Block AI bots, scrapers and crawlers with a single click

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers

Related

Bots Compose 42% of Overall Web Traffic; Nearly Two-Thirds Are Malicious

Block AI bots, scrapers and crawlers with a single click

AI crawlers need to be more respectful

Anthropic is scraping websites so fast it's causing problems

Websites Are Blocking the Wrong AI Scrapers