September 2nd, 2024

Full Text, Full Archive RSS Feeds for Any Blog

The blog post highlights limitations of RSS and ATOM feeds in cyber threat intelligence, introducing history4feed software to create historical archives and retrieve full articles for comprehensive data access.

Read original article

Full Text, Full Archive RSS Feeds for Any Blog

The blog post discusses the limitations of RSS and ATOM feeds, particularly in the context of cyber threat intelligence research. It highlights two main issues: the lack of historical data in feeds, which typically only display a limited number of recent posts, and the partial content provided, which often requires users to visit the original blog for full articles. To address these challenges, the author has developed open-source software called history4feed, which allows users to create a complete historical archive of blog posts by scraping content and utilizing the Wayback Machine. The software can retrieve full text from articles and reconstruct a comprehensive feed, enabling researchers to access both current and historical data effectively. The post also provides a practical example of how to use the tool to gather and manage feeds, emphasizing its utility for those interested in cyber threat intelligence.

- RSS and ATOM feeds often lack historical data and provide only partial content.

- The author developed history4feed to create complete historical archives of blog posts.

- The software can scrape content and utilize the Wayback Machine for comprehensive data retrieval.

- history4feed allows users to access both current and historical cyber threat intelligence research.

- Practical examples are provided for using the tool to manage and retrieve blog feeds.

Surfing the (Human-Made) Internet

The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.

How I scraped 6 years of Reddit posts in JSON

The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.

Archiving and Syndicating Mastodon Posts

The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.

Two months of feed reader behavior analysis

An analysis of feed reader behavior revealed significant request handling patterns, with some applications like Netvibes and NextCloud-News facing caching issues, while others like Miniflux performed better.

NetNewsWire and Conditional Get Issues

Brent Simmons addresses bugs in NetNewsWire's conditional GET support, revealing issues with feed data processing. He suggests improved logic for updates and stresses the need for further testing to ensure reliability.

16 comments

By @shortformblog - 8 months

As a publisher who publishes a full-text RSS feed at a time when not a lot of publishers do, I must say: The publisher should have a say in this.

This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.

A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...

They saw that a lot of their content was just getting scraped by random AI sites, so they put up a regwall to try to limit that as much as possible. But readers wanted access to full-text RSS feeds, so they went out of their way to create a full-text RSS offering for subscribers with a degree of security so it couldn’t be siphoned.

I do not think this tool was created in bad faith, and I hope that my comment is not seen as being in bad faith, but: You will find better relationships with the writers you share if you ask rather than just take. They may have reasons for not having RSS feeds you may not be aware of. For example, I don’t want my content distributed in audio format, because I want to leave that option open for myself.

People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.

By @pentagrama - 8 months

> generally the RSS and ATOM feeds for any blog, are limited in two ways;

> 1. [limited history of posts]

> 2. [partial content]

To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.

[1] https://www.sitemaps.org/

By @renegat0x0 - 8 months

Similar goal, different approach. I wrote RSS reader, that captures link meta from various RSS sources. The meta data are exported every day. I have different repositories for bookmarks, different for daily links, different for 'known domains'.

Written in Django.

I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.

- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper

- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting

- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage

- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet

After creating python package for web communication, that replaces requests for me, which uses sometimes selenium I wrote also CLI interface to read RSS sources from commandline: https://github.com/rumca-js/yafr

By @yawnxyz - 8 months

It's so clever to just pull from Wayback Machine rather than scrape the site itself. Never even thought of that

By @msephton - 8 months

This reminds me of something I wrote in early 2000. At that time RSS was less than a year old and if I'm honest I wasn't aware of it at all. I wrote a short PHP script to get the HTML of each site in a list, do a diff against the most recent snapshot, and generate a web page with a table containing all the changes. I could set per site thresholds for change value to cope with small dynamic content like dates and exclude certain latger sections of content via regexp. I probably still have the code in my backups from the dot com boom job I had at the time.

By @zczc - 8 months

Looks like a nice tool for extending existing RSS sources. As for the sites that don't have RSS support in the first place, there is also RSSHub [1]. Sadly, you can't use both for the same source: history4feed's trick with the Wayback Machine wouldn't work with the RSSHub feed.

[1] https://rsshub.app/

By @wonderfuly - 8 months

Awesome, I once developed a project called https://rerss.xyz, aimed at creating an RSS feed that reorders historical blog posts, but it was hindered by the two issues mentioned in the article.

By @z3t4 - 8 months

The mystical creature - the URL - is a link to a resource that doesn't have to be static, it's only the URL that is static. eg. the content might change. So you might want to have the program revisit the resource once in a while to see if there are updates.

By @steamodon - 8 months

I wrote a similar tool [1], although it's designed to let you gradually catch up on a backlog rather than write a full feed all at once. Right now it only works on Blogger and WordPress blogs, so I'll need to learn from their trick of pulling from Internet Archive.

[1] https://github.com/steadmon/blog-replay

By @johnbellone - 8 months

Someone somewhere is still running a gopher server.

By @latexr - 8 months

> RSS and ATOM feeds are problematic for two reasons; 1) lack of history, 2) contain limited post content.

None of those are problems with RSS or Atom¹ feeds. There’s no technical limitation to having the full history and full post content in the feeds. Many feeds behave that way due to a choice by the author or as the default behaviour of the blogging platform. Both have reasons to be: saving bandwidth² and driving traffic to the site³.

Which is not to say what you just made doesn’t have value. It does, and kudos for making it. But twice at the top of your post you’re making it sound as if those are problems inherit with the format when they’re not. They’re not even problems for most people in most situations, you just bumped into a very specific use-case.

¹ It’s not an acronym, it shouldn’t be all uppercase.

² Many feed readers misbehave and download the whole thing instead of checking ETags.

³ To show ads or something else.

By @twoprops - 8 months

Does no one find it ironic that one of the complaints about RSS feeds is they don't give you the full content, forcing you to visit the site, while trying to access the poster's web site through reader view gives you a warning that you have to visit the site directly to get the full content?

By @breck - 8 months

The future of RSS is "git clone".

RSS was invented in 1999, 6 years before git!

Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.

I still have RSS feeds on all my blogs for back-compat, but git clone is way better.