September 10th, 2024

Show HN: Free tool to find RSS feeds, even if not linked on the page

A new tool at lighthouseapp.io helps users find RSS feeds, even if unlinked, by checking meta tags, common suffixes, sitemaps, and third-party feeds, with future enhancements planned.

A new tool has been developed to help users find RSS feeds for websites, available at lighthouseapp.io. The tool aims to identify feeds even if they are not directly linked on the site, addressing the limitations of traditional methods that rely on checking meta tags. While it currently succeeds in over 90% of cases using standard techniques, the tool's goal is to ensure that if it does not find a feed, then none exists. It performs several functions to achieve this, including checking the meta tags of parent pages, looking for common feed suffixes like /rss and /index.xml, examining the sitemap, analyzing all links on the page, and checking third-party feeds such as OpenRSS. Future enhancements may include searching through search engines and crawling entire domains, although the latter may be inefficient. Users are encouraged to test the tool and report any sites where it fails to find feeds.

- A new tool for finding RSS feeds is available at lighthouseapp.io.

- The tool aims to identify feeds even if they are not linked on the website.

- It checks meta tags, common feed suffixes, sitemaps, and third-party feeds.

- Future improvements may include search engine checks and domain crawling.

- Users are invited to provide feedback on sites where the tool does not work.

How I scraped 6 years of Reddit posts in JSON

The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.

Two months of feed reader behavior analysis

An analysis of feed reader behavior revealed significant request handling patterns, with some applications like Netvibes and NextCloud-News facing caching issues, while others like Miniflux performed better.

13ft – A site similar to 12ft.io but is self hosted

The 13 Feet Ladder project is a self-hosted server that bypasses paywalls and ads, allowing access to restricted content from sites like Medium and The New York Times.

Full Text, Full Archive RSS Feeds for Any Blog

The blog post highlights limitations of RSS and ATOM feeds in cyber threat intelligence, introducing history4feed software to create historical archives and retrieve full articles for comprehensive data access.

Show HN: I'm making an AI scraper called FetchFox

FetchFox is an AI-powered Chrome extension that allows users to scrape data from websites by describing their needs in plain English, bypassing anti-scraping measures, and exporting results in CSV format.

28 comments

By @rollcat - 8 months

Quick rant about websites that go into all the trouble of having an RSS feed but not linking to it in the <head>... I don't want to go hunting for the cute orange button, I want to copy and paste "https://example.com" into my feed reader and let the computer handle the work.

If you maintain any website with a news feed, go right now and check that you have this in your <head>:

    <link rel="alternate" type="application/rss+xml" href="/rss.xml" title="News feed" />
                                                           ^^^^^^^^ change! ^^^^^^^^^

(Also note whether and where you need to use application/rss+xml, application/atom+xml, or application/json.)

By @jcul - 8 months

This is great, it's hard to believe sites can have RSS feeds but make it so difficult to find.

I suspect some sites are just running some framework than enables it and don't even realize they have one.

I have used this site in the past to find feeds: https://www.rsssearchhub.com/

In the past I was looking for a feed for https://ra.co, but could not find it, though I had seen old posts referencing a RSS feed.

I ended up emailing them and, to my delight, they let me know they still have an unsupported RSS feed here:

https://ra.co/xml/rss_news.xml

Just for feedback, this tool doesn't find the feed, though it doesn't look like a standard URL to me.

By @LorenDB - 8 months

If I can't find an RSS link directly, I generally copy the root URL into archive.org and search for all URLs matching "xml", which includes content type, not just URL names.

By @superkuh - 8 months

This is 100% a feature that should be in the browser, not a third party tool. I still use an very old version of Firefox for this. Too bad Mozilla decided auto-discovery wasn't necessary in 2016 and removed it. Then two years later claimed no one was aware of RSS/Atom feeds and didn't use them (I wonder why?!?). All so they could try to replace it with their profit/adware that is pocket and we all know how that went.

>Mozilla is working on alternatives such as Pocket or Reader Mode, and on improving WebExtensions which could provide features related to RSS/Atom feeds without the toll on maintenance. (ref: https://www.ghacks.net/2018/07/25/mozilla-plans-to-remove-rs...)

By @AiAi - 8 months

Interesting. These days I was trying to subscribe to some blogs, and they didn’t have a RSS button in their page, so I had to inspect the page to find out the feed URL. Not sure why keep a RSS feed but hide from the visitors. It could be it expected the feed reader to be able to identify it, but since I was using Thunderbird it did not.

By @account42 - 8 months

> Application error: a client-side exception has occurred (see the browser console for more information).

Ok then.

Also, this would make more sense as a browser extension. Especially if it brought back the RSS icon in the address bar to indicate when a feed is available (although maybe you don't want it to do all of the checks until prompted).

By @sodality2 - 8 months

Great idea. I tried it with my personal site (https://matthew.science) and it didn't find any, which admittedly doesn't have any meta tags, but it is linked at the footer at https://matthew.science/atom.xml. It was the default feed URL for my SSG. I'd recommend adding this to the common suffix list.

By @Cieric - 8 months

Tried the hacker news front page (https://news.ycombinator.com/news) and when clicking on OpenRSS I get this error:

TypeError: URL constructor: is not a valid URL. [NextJS] (5603-cb6f1c5a9761f9d0.js:14:5466)

Browser is Firefox 130.0 on Windows.

Would be really nice to see this working really well since I search for RSS feeds a lot for a bunch of different things. Whether the RSS feed is good is always another question.

By @DamonHD - 8 months

FYI it's only finding one (Atom) feed at earth.org.uk, even though there are several feeds, Atom and RSS.

Your method described above should have found at least two feeds I think.

By @freetonik - 8 months

I've been using an NPM package called rss-url-finder [1] in my blog search engine project to find the RSS link. It works relatively well, but still fails sometimes. For now I end up manually searching the source code of the HTML page for .xml or similar link.

[1] https://www.npmjs.com/package/rss-url-finder

By @Circlecrypto2 - 8 months

I am very grateful for this actually. I still read RSS and when I find a good news site I tend to spend 15 minutes or more looking for their feed.

By @jayemar - 8 months

Are you opposed to this being used programmatically? I've been working on a site [0] that replays feeds, but the initial step is to first find the feed given a website, and it's not always able to find it. I'd be interested in using your service to try to find the feed when I'm unable to do so.

[0] https://refeed.to

By @snthd - 8 months

check out

https://github.com/DIYgod/RSSHub/

https://github.com/DIYgod/RSSHub-Radar

By @nanna - 8 months

Great work! I've stopped using Twitter but I managed to taper from it by following things using RSS feeds drawn from Nitter. Don't know if that still works but could be an idea?

By @validatori - 8 months

add also .feed to common suffixes example: https://wiadomosci.onet.pl/.feed

By @chuanliang - 8 months

Great tools.

I always use RSSHub Radar , Your tools support more website than RSSHub Radar

Detection of /feed could be added, most wordpres supported sites have this suffix

By @cranberryturkey - 8 months

Cool. I wrote a script to search google and find sites with rss feeds so I can create a collection on a particular topic.

By @richardbui95 - 8 months

I tried it on my website, ebookany.com, but didn't find anything. So sad :(( But your idea is quite interesting.

By @stuaxo - 8 months

I bet this finds some feeds that sites don't know or have forgotten they even have.

By @oidar - 8 months

The tool misses reddit rss feeds.

By @AIPodNav-Team - 8 months

cant find lex fridman podcast's feed. https://lexfridman.com/

By @asddubs - 8 months

my suggestion is a way to have users of the extension suggest a feed URL if it doesn't find one

By @GavCo - 8 months

Cool. I'm a big fan of RSS feeds.

Wondering if it's necessary to continue with the other checks if you find a feed in the meta tags?

By @cxr - 8 months

[deleted]

By @dotBen - 8 months

RIP Google Reader

How I scraped 6 years of Reddit posts in JSON

Two months of feed reader behavior analysis

13ft – A site similar to 12ft.io but is self hosted

The 13 Feet Ladder project is a self-hosted server that bypasses paywalls and ads, allowing access to restricted content from sites like Medium and The New York Times.

Show HN: Free tool to find RSS feeds, even if not linked on the page

Related

How I scraped 6 years of Reddit posts in JSON

Two months of feed reader behavior analysis

13ft – A site similar to 12ft.io but is self hosted

Full Text, Full Archive RSS Feeds for Any Blog

Show HN: I'm making an AI scraper called FetchFox

Related

How I scraped 6 years of Reddit posts in JSON

Two months of feed reader behavior analysis

13ft – A site similar to 12ft.io but is self hosted

Full Text, Full Archive RSS Feeds for Any Blog

Show HN: I'm making an AI scraper called FetchFox