July 24th, 2024

Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

Google secures exclusive search rights on Reddit through a lucrative deal, hindering other search engines' access. Reddit tightens restrictions to safeguard content and address challenges posed by dominant search engines.

Read original articleLink Icon
Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

Google has become the exclusive search engine for Reddit due to a multi-million dollar deal that allows Google to scrape Reddit for data to train its AI products. Other search engines like Bing, DuckDuckGo, and Mojeek are no longer able to provide full Reddit results, limiting users' access to recent content. Reddit has updated its robots.txt file to block certain crawlers, including those used by AI companies, to protect its content from misuse. The deal between Google and Reddit highlights the challenges smaller search engines face in competing with Google's dominance in search. The move also reflects the unintended consequences of widespread web scraping for AI training, impacting the availability of alternative search options. Reddit's stricter policies aim to prevent unauthorized use of its content for commercial purposes, emphasizing the importance of respecting terms and policies when accessing Reddit data. The situation underscores the evolving landscape of online search and the implications of exclusive data access agreements on internet accessibility and competition.

Related

Google rejected me and now I'm building a search engine

Google rejected me and now I'm building a search engine

The article recounts a rejection from Google during an interview, prompting the individual to create a non-profit, community-driven search engine emphasizing ethical values over profit, welcoming contributions for development.

Google Search Ranks AI Spam Above Original Reporting in News Results

Google Search Ranks AI Spam Above Original Reporting in News Results

Google Search faces challenges as AI-generated spam surpasses original reporting in news results. Despite efforts to combat this issue, plagiarized articles with AI-generated illustrations dominate search rankings, raising concerns among SEO experts and original content creators.

Reddit has updated its robots.txt to block all web crawlers

Reddit has updated its robots.txt to block all web crawlers

Reddit updated its robots.txt file to block web crawlers, aiming to protect user privacy and prevent content misuse. This change impacts data access for entities like Google, potentially hindering legitimate research. CEO Steve Huffman emphasizes balancing data use costs. The effects on search engines and partnerships are uncertain.

Google Now Defaults to Not Indexing Your Content

Google Now Defaults to Not Indexing Your Content

Google has changed its indexing to prioritize unique, authoritative, and recognizable content. This selective approach may exclude smaller players, making visibility harder. Content creators face challenges adapting to Google's exclusive indexing, affecting search results.

'Google says I'm a dead physicist': is the biggest search engine broken?

'Google says I'm a dead physicist': is the biggest search engine broken?

Google faces scrutiny over search result accuracy and reliability, with concerns about incorrect information and cluttered interface. Despite dominance in the search market, criticisms persist regarding data privacy and search quality.

Link Icon 62 comments
By @tbeseda - 6 months
By @popcalc - 6 months

  # Welcome to Reddit's robots.txt
  # Reddit believes in an open internet, but not the misuse of public content.
  # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
  # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
  # policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy

  User-agent: *
  Disallow: /
Source: https://www.reddit.com/robots.txt
By @arnaudsm - 6 months
I understand the AI context, but this is dangerously anticompetitive for other search engines.

This is a dangerous precedent for the internet. Business conglomerates have been controlling most of the web, but refusing basic interoperability is even worse.

By @onlyrealcuzzo - 6 months
This is an interesting development.

How many other sites might have leverage to charge to be indexed?

I don't want to live in a world where you have to use X search engine to get answers from Y site - but this seems like the beginning of that world.

From an efficiency perspective - it's obviously better for websites to just lease their data to search engines then both sides paying tons of bandwidth and compute to get that data onto search engines.

Realistically, there are only 2 search engines now.

This seems very bad for Kagi - but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?

By @StrauXX - 6 months
IANAL but as far as I understand the current legal status (in the US) a change in robots.txt or terms and conditions is not binding for web scrapers since the data is publicly accessible. Neither does displaying a banner "By using this site you accept our terms and conditions" change anything about that. The only thing that can make these kinds of terms binding is if the data is only accessible after proactively accepting terms. For instance by restricting the website until one has created an account. Linkedin lost a case against a startup scraping and indexing their data because of that a few years ago.
By @wtf242 - 6 months
This problem is only going to get worse. for my thegreatestbooks.org site i used to just get indexed/scraped by google and bing. now it's like 50+ AI bots scraping my entire site just so they can train a LLM to answer questions my site answers without having a user ever visit my site. I just checked cloudflare and in the past 24 hours I've had 1.2 million bot/automated requests
By @jedberg - 6 months
They changed robots.txt a month or so ago. For the first 19 years of life, reddit had a very permissive robots.txt. We allowed all by default and then only restricted certain poorly behaved agents (and Bender's Shiny Metal Ass(tm))

But I can understand why they made the change they did. The data was being abused.

My guess is that this was an oversight -- that they will do an audit and reopen it for search engines after those engines agree not to use the data for training, because let's face it, reddit is a for profit business and they have to protect their income streams.

By @ykonstant - 6 months
It's ironic, because Reddit is the only search engine that works on Google now thanks to shittening.
By @daft_pink - 6 months
I don’t understand how this isn’t anti-competitive behavior. It seems like reddit has to offer this deal with similar terms to google’s competitors.
By @lmeyerov - 6 months
FWIW, we inquired to the reddit sales team about paying for data sometime last year, as we do similar elsewhere for use cases like helping emergency responders, and even though they were launching the program and asking for customers... no email back. Nor on our second and I think third attempt.

I'm not sure what to make of that.

By @dathinab - 6 months
Worse it doesn't even really "work" anymore, giving how most search are flooded with garbage SEO results and payed advertisements "basically" looking like search results (most times more garbage not what you are looking for results, int he cases where it isn't it quite often times is on the line of "googles algorithm blackmailing companies to buy ads for users which want to find them through google but wouldn't without ads".)

I wonder if this might affect redis, as in slowly kill it's user base especially when it comes to user providing (and often also looking for) high quality content, because who of such users would want to use google search?

By @numbers - 6 months
"Information is power. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations." - Aaron Swartz (2008)
By @1vuio0pswjnm7 - 6 months
"If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn't rely on Google's indexing and search Reddit by using "site:reddit.com," you will not see any results from the last week."

The veracity of this statement is questionable.

I found at least four web search engines not using Google's index that produced results from the last week.

Example: Recent eruption at Yellowstone Black Diamond Pool

https://www.ecosia.org/search?method=index&q=site:reddit.com...

https://search.brave.com/search?q=reddit.com+black+diamond+p...

https://api.yep.com/fs/2/search?client=web&gl=all&no_correct...

   POST /sp/search HTTP/1.0
   host: www.startpage.com
   content-length: 74
   content-type: application/x-www-form-urlencoded
   query=site:reddit.com black diamond pool&abp=-1&t=&lui=english&sc=&cat=web
At least for this example, I got the same desired result using Reddit site search.

https://old.reddit.com/search/?q=black+diamond+pool

If anyone has some good examples of search queries that I can test showing why a search engine must be used, please share.

By @r_singh - 6 months
I wonder how Aaron Swartz would react to this
By @voisin - 6 months
Makes sense that Google did this deal since their search quality tanked and they became an de facto front end UI for Reddit.
By @mutatio - 6 months
It's funny in the context of Google's past motto of "don't be evil". I feel the right thing for Google here would have been to decline any deal regarding exclusivity, then Reddit wouldn't have pulled the trigger with its robots.txt update. The entire manoeuvre required both parties.
By @roughly - 6 months
Boy, the LLMs have really been an apocalypse moment for the web, haven’t they? Between hoovering up and monetizing every bit of content they can without any attribution or compensation and the absolute flood of mediocre generated content, they’ve really done in the last straggling remains of the open internet.

It’s not like everyone wasn’t already pulling the same grift, but quantity really does have a quality all its own.

By @lifestyleguru - 6 months
I deeply regret every minute spent on and kilobyte of text contributed to reddit.
By @neilv - 6 months
I'm concerned multiple ways by this, but I also could see some positive fallout from this, if it sets precedents that help protect 'content' owners from AI goldrush companies just taking everything.
By @PaulRobinson - 6 months
This is great. It means I won't see Reddit content popping up all over search results in other engines. Can Medium do the same? And perhaps Quora?
By @nullc - 6 months
It's weird to say that reddit "works" with google. Every page they serve to google is stuffed full of hidden unrelated content, so any reddit result in google is unlikely to actually contain what you were searching for.

Google really should blacklist reddit entirely for this practice, but sadly as bad as reddit is it's still a much higher quality result than average for google.

By @jumploops - 6 months
IIRC, GPT-2 was primarily trained on Reddit[0]

[0]https://www.reddit.com/r/ChatGPT/comments/133xgb5/gpt2_was_p...

By @ChrisArchitect - 6 months
Fine with this. This is the world OpenAI created. And all the people that started searching with +Reddit tacked on weirdly like 5 years ago. Reddit's covering themselves from internal user-concern and their general exposure to AI training and Google was smart enough to get on that quickly. We'll see what Bing's take is and what changes if anything now that 404medias's outrage farming is at play. This isn't a recent change afterall, month ago?
By @nomilk - 6 months
Suppose a crawler or rival search engine doesn’t respect robots.txt, reddit can’t stop them. Make it a bit trickier, yes, but not stop them.
By @r_singh - 6 months
Thinking from reddits perspective they have nothing to lose really. It’s not like other search engines are going to pay any attention to the robots txt and Google’s AI would have still scraped data from Reddit regardless of the deal. Now they will just feel less bad about not citing sources possibly, depending on the user experience they want to deliver.
By @debacle - 6 months
Reddit has been ripe for disruption for years. It's just waiting on an inflection point and someone to take it behind the barn.
By @myrandomcomment - 6 months
So I went Slashdot, Digg, Reddit. I stopped spending any time on Reddit 5 years ago. Not worth it.
By @cyanydeez - 6 months
Work is such a flimsy word for qhat google currently does with search

As soon as someone shows me a search engine that restores quality of searxh, im getting a subscription for work.

It really cany be hard to whitelist sources and index appropiately.

Get goimg nerds , google has fallen.

By @thih9 - 6 months
Story / rant warning.

I remember seeing an unhelpful hyperlink for the first time. It was a random word in the body of a random tech site that redirected to a list of articles from that site tagged with that term.

I remember being stunned, my expectation was that the link would lead me to another website, one that would be an authoritative source on that term and freely accessible.

20 years later we get a paywalled article about fragmented web – and we’re not slowing down.

By @1vuio0pswjnm7 - 6 months
By @blackeyeblitzar - 6 months
We need laws that make it so that giant platforms like Reddit have no exclusive rights to content submitted by users. It would be ridiculous for only Google to be able to train AI on YouTube or Reddit content for example.
By @causal - 6 months
It feels like Reddit is approaching an inflection point anyway where bot-made content is concentrated enough to spoil the whole experience. Closed servers like Discord and Slack may be the last haven of online human interaction.
By @ozgrakkurt - 6 months
Stopped using reddit after they hindered login-less viewing and blocked vpns. Everyone who respect themselves should start moving away from it imho. Same thing with google
By @manishsharan - 6 months
For my use cases , Google is pretty much useless without Reddit

For example, when I search for product reviews, I always specify reddit. Otherwise the search results are inundated with SEO spam.

By @tempfile - 6 months
Hopefully this paves the way for antitrust action, but I won't hold my breath.

Reddit's justification for this is profoundly wrong. Their "public content policy" is absurd doublespeak, and counter to everything the open internet is and hopes to be. You cannot simultaneously call yourself "open" and "public" while refusing access to automated clients. Every client is automated. They even go so far as to say that "crawling" (also known as "downloading") is an "abuse" and violates user privacy.

This is absurd, and not justified. I would love to see legislation that restricted server operators' ability to prohibit automated access in this way, but I suppose it will never happen. Some people in this thread have attempted to justify the policy by saying "they have to protect their income streams". No they don't. You don't have a right to an income stream, and you certainly don't have a right to lie in order to get all the benefits of an open internet with none of the downsides. Noting of course that the "downsides" are in this case actually just "competitors".

By @lowbloodsugar - 6 months
Funny that source of TFA blocked me from reading the whole thing.
By @earthboundkid - 6 months
They literally think the scissor statement is a real thing that will really work, fml.
By @dvngnt_ - 6 months
site:reddit.com works for kagi for new posts this week?
By @Elfener - 6 months
I mean, the reddit company did go public, so things like this were inevitable.

Also things like the API fiasco, and also small annoyances like the fact that when you click on an image on reddit, it now goes to a wrapper html page instead of just the actual image (this was one of the reasons reddit was better than most social media...).

By @melodyogonna - 6 months
Wait that's actually terrible.
By @VoidWhisperer - 6 months
Wow, reddit found a way to make themselves even less useful somehow. After the API fiasco, that seemed like it'd be pretty hard to do.
By @ein0p - 6 months
Good for other search engines, I suppose. Reddit is a giant toxic pile of bovine manure.
By @nerfbatplz - 6 months
I propose we change the term enshitification to engoogleification in regards to the internet.
By @venkat223 - 6 months
Google is selfish
By @mediumsmart - 6 months
that is awesome but I can't open old.reddit.com in my browser so its a non-issue.
By @dakial1 - 6 months
And now lets watch white, grey and black hat SEO destroy reddit even more.
By @dbg31415 - 6 months
Every time I think, “How scummy…” Reddit always finds another way to go lower.
By @venkat223 - 6 months
google is selfish
By @Khelavaster - 6 months
robots.txt isn't legally binding. Can Reddit really force Bing not to crawl it..?
By @bitpush - 6 months
When Microsoft strikes an exclusive deal with OpenAI to use their models, it is a smart, brilliant, clever move.

When Apple strikes an exclusive deal with suppliers for parts, it is sound business practice.

When Google strikes an exclusive deal with Reddit, it is ..

Some of you have no idea how businesses work, and it shows.