Full Text, Full Archive RSS Feeds for Any Blog
The blog post highlights limitations of RSS and ATOM feeds in cyber threat intelligence, introducing history4feed software to create historical archives and retrieve full articles for comprehensive data access.
Read original articleThe blog post discusses the limitations of RSS and ATOM feeds, particularly in the context of cyber threat intelligence research. It highlights two main issues: the lack of historical data in feeds, which typically only display a limited number of recent posts, and the partial content provided, which often requires users to visit the original blog for full articles. To address these challenges, the author has developed open-source software called history4feed, which allows users to create a complete historical archive of blog posts by scraping content and utilizing the Wayback Machine. The software can retrieve full text from articles and reconstruct a comprehensive feed, enabling researchers to access both current and historical data effectively. The post also provides a practical example of how to use the tool to gather and manage feeds, emphasizing its utility for those interested in cyber threat intelligence.
- RSS and ATOM feeds often lack historical data and provide only partial content.
- The author developed history4feed to create complete historical archives of blog posts.
- The software can scrape content and utilize the Wayback Machine for comprehensive data retrieval.
- history4feed allows users to access both current and historical cyber threat intelligence research.
- Practical examples are provided for using the tool to manage and retrieve blog feeds.
Related
Surfing the (Human-Made) Internet
The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Archiving and Syndicating Mastodon Posts
The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.
Two months of feed reader behavior analysis
An analysis of feed reader behavior revealed significant request handling patterns, with some applications like Netvibes and NextCloud-News facing caching issues, while others like Miniflux performed better.
NetNewsWire and Conditional Get Issues
Brent Simmons addresses bugs in NetNewsWire's conditional GET support, revealing issues with feed data processing. He suggests improved logic for updates and stresses the need for further testing to ensure reliability.
This is not to say that this is a good idea or a bad one, but I think you will, long-term, have better luck if people don’t feel their content is being siphoned.
A great case-in-point is what my friends at 404 Media did: https://www.404media.co/why-404-media-needs-your-email-addre...
They saw that a lot of their content was just getting scraped by random AI sites, so they put up a regwall to try to limit that as much as possible. But readers wanted access to full-text RSS feeds, so they went out of their way to create a full-text RSS offering for subscribers with a degree of security so it couldn’t be siphoned.
I do not think this tool was created in bad faith, and I hope that my comment is not seen as being in bad faith, but: You will find better relationships with the writers you share if you ask rather than just take. They may have reasons for not having RSS feeds you may not be aware of. For example, I don’t want my content distributed in audio format, because I want to leave that option open for myself.
People should have a say in how their content is distributed. I worry what happens when you take those choices away from publishers.
> 1. [limited history of posts]
> 2. [partial content]
To fix the limitation N°1 on some cases, maybe the author can rely on sitemaps [1], is a feature present in many sites (as RSS feeds) and it shows all the pages published.
Written in Django.
I can always go back, parse saved data. If web page is not available, I fall back to Internet Archive.
- https://github.com/rumca-js/Django-link-archive - RSS reader / web scraper
- https://github.com/rumca-js/RSS-Link-Database - bookmarks I found interesting
- https://github.com/rumca-js/RSS-Link-Database-2024 - every day storage
- https://github.com/rumca-js/Internet-Places-Database - internet domains found on the internet
After creating python package for web communication, that replaces requests for me, which uses sometimes selenium I wrote also CLI interface to read RSS sources from commandline: https://github.com/rumca-js/yafr
None of those are problems with RSS or Atom¹ feeds. There’s no technical limitation to having the full history and full post content in the feeds. Many feeds behave that way due to a choice by the author or as the default behaviour of the blogging platform. Both have reasons to be: saving bandwidth² and driving traffic to the site³.
Which is not to say what you just made doesn’t have value. It does, and kudos for making it. But twice at the top of your post you’re making it sound as if those are problems inherit with the format when they’re not. They’re not even problems for most people in most situations, you just bumped into a very specific use-case.
¹ It’s not an acronym, it shouldn’t be all uppercase.
² Many feed readers misbehave and download the whole thing instead of checking ETags.
³ To show ads or something else.
RSS was invented in 1999, 6 years before git!
Now we have git and should just be "git cloning" blogs you like, rather than subscribing to RSS feeds.
I still have RSS feeds on all my blogs for back-compat, but git clone is way better.
Related
Surfing the (Human-Made) Internet
The internet's evolution prompts a return to its human side, advocating for personal sites, niche content, and self-hosted platforms. Strategies include exploring blogrolls, creating link directories, and using alternative search engines. Embrace decentralized social media and RSS feeds for enriched online experiences.
How I scraped 6 years of Reddit posts in JSON
The article covers scraping 6 years of Reddit posts for self-promotion data, highlighting challenges like post limits and cutoffs. Pushshift is suggested for Reddit archives. Extracting URLs and checking website status are explained. Findings reveal 40% of sites inactive. Trends in online startups are discussed.
Archiving and Syndicating Mastodon Posts
The article details archiving Mastodon posts to a personal website using the PESOS model, emphasizing online presence, automation, and content organization through a custom tool developed in Go.
Two months of feed reader behavior analysis
An analysis of feed reader behavior revealed significant request handling patterns, with some applications like Netvibes and NextCloud-News facing caching issues, while others like Miniflux performed better.
NetNewsWire and Conditional Get Issues
Brent Simmons addresses bugs in NetNewsWire's conditional GET support, revealing issues with feed data processing. He suggests improved logic for updates and stresses the need for further testing to ensure reliability.