July 4th, 2024

Reddit has updated its robots.txt to block all web crawlers

Reddit updated its robots.txt file to block web crawlers, aiming to protect user privacy and prevent content misuse. This change impacts data access for entities like Google, potentially hindering legitimate research. CEO Steve Huffman emphasizes balancing data use costs. The effects on search engines and partnerships are uncertain.

Read original article

Reddit has updated its robots.txt to block all web crawlers

Reddit has updated its robots.txt file to block all web crawlers as part of its strategy to protect user privacy and prevent content misuse. The new Public Content Policy aims to limit unauthorized data collection on the platform. This change affects access to Reddit content for various entities, including Google, which has a partnership with Reddit. The updated robots.txt now prohibits all user-agents from accessing Reddit pages, unlike the previous version that allowed limited access to certain crawlers. While this move may hinder researchers and developers from accessing Reddit data for legitimate purposes, it also serves as a safeguard against the exploitation of user-generated content, particularly for AI training. Reddit's CEO, Steve Huffman, has highlighted the need to balance data use costs and sustainability. The impact of these changes on search engine results and partnerships remains unclear, with discrepancies observed in search engine behavior. Further clarification from Reddit is awaited to confirm the implications of these modifications.

OpenAI and Anthropic are ignoring robots.txt

Two AI startups, OpenAI and Anthropic, are reported to be disregarding robots.txt rules, allowing them to scrape web content despite claiming to respect such regulations. TollBit analytics revealed this behavior, raising concerns about data misuse.

We need an evolved robots.txt and regulations to enforce it

In the era of AI, the robots.txt file faces limitations in guiding web crawlers. Proposals advocate for enhanced standards to regulate content indexing, caching, and language model training. Stricter enforcement, including penalties for violators like Perplexity AI, is urged to protect content creators and uphold ethical AI practices.

Google Search Ranks AI Spam Above Original Reporting in News Results

Google Search faces challenges as AI-generated spam surpasses original reporting in news results. Despite efforts to combat this issue, plagiarized articles with AI-generated illustrations dominate search rankings, raising concerns among SEO experts and original content creators.

Block AI bots, scrapers and crawlers with a single click

Cloudflare launches a feature to block AI bots easily, safeguarding content creators from unethical scraping. Identified bots include Bytespider, Amazonbot, ClaudeBot, and GPTBot. Cloudflare enhances bot detection to protect websites.

Cloudflare debuts one-click nuke of web-scraping AI

Cloudflare launches one-click solution to block AI bots scraping websites without permission. Aiming to combat dishonest AI bot activities, the feature complements robots.txt method, detecting and blocking bots disguising as browsers.

6 comments

By @dunno7456 - 10 months

"to tell crawler to not crawl" which can be ignored AFAIK

By @littlecranky67 - 10 months

> User-agent: *

> Disallow: /

Ugh oh, that means all search engines are gona delist reddit content.

By @seeknotfind - 10 months

All of Reddit was freely and readily available just a few years ago. Just goes to show - archive and save what you love.

By @nojvek - 10 months

Reddit making deals with search engines and AI companies for millions of dollars.

Public data belong to Reddit to sell. Makes sense, why would they give it away for free when they can charge for it.

By @b3ing - 10 months

I figured it was to warn people not to use “their” data for AI. The data belongs to their users though

By @nunez - 10 months

God fucking damn it.

"User privacy" my ass. This is a pure lock-in play.

Sorry for the swear words. Reddit was _the_ way I got honest reviews about restaurants, products, and damn near everything, but their search engine was horrible and the platform is very clearly built to drive engagement.

I hate what the Internet has become. I guess it's time to go through the book list I've accumulated over the years.

Reddit has updated its robots.txt to block all web crawlers

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

Google Search Ranks AI Spam Above Original Reporting in News Results

Block AI bots, scrapers and crawlers with a single click

Cloudflare debuts one-click nuke of web-scraping AI

Related

OpenAI and Anthropic are ignoring robots.txt

We need an evolved robots.txt and regulations to enforce it

Google Search Ranks AI Spam Above Original Reporting in News Results

Block AI bots, scrapers and crawlers with a single click

Cloudflare debuts one-click nuke of web-scraping AI