August 23rd, 2024

Surfer: Centralize all your personal data from online platforms

Surfer centralizes personal data from various online platforms by scraping and exporting it to local storage. It is available for download, with community support through Discord and a roadmap for future enhancements.

Read original articleLink Icon
SkepticismInterestFrustration
Surfer: Centralize all your personal data from online platforms

Surfer is a project designed to centralize personal data from various online platforms into a single folder, addressing the challenge of scattered data. The application functions by navigating to websites, checking user sign-in status, and scraping data for export to local storage. Users initiate the process by clicking an "Export" button, after which the app waits for the target page to load, verifies if the user is signed in, and then proceeds to scrape and export the data. Surfer can be downloaded from its official website or GitHub releases page, with guidelines available for local setup and contributions. The project has a roadmap that includes short-term goals like obtaining a code signing certificate and expanding platform support, as well as long-term objectives such as implementing concurrent scraping and integrating with advanced AI frameworks. Surfer is distributed under the MIT License, and community support is available through a Discord server. A demo video showcasing the application is also accessible on YouTube.

- Surfer centralizes personal data from multiple online platforms.

- The application scrapes user data and exports it to local storage.

- It is available for download from its website and GitHub.

- The project aims to expand platform support and enhance functionality.

- Community interaction is facilitated through a Discord server.

AI: What people are saying
The comments on the article about Surfer reveal a mix of skepticism and interest regarding the tool's capabilities and implications for data privacy.
  • Several commenters point out that Surfer is not the first tool of its kind, with existing alternatives like DogSheep already available.
  • There are concerns about the centralization of personal data, with some viewing it as a potential privacy risk.
  • Users express a desire for more extensive platform support and customization options, such as CLI tools or integrations with existing systems.
  • Feedback from a contributor indicates ongoing development and openness to community input.
  • Some users highlight the technical challenges of maintaining scrapers due to frequent changes in platform APIs.
Link Icon 16 comments
By @markjgx - 5 months
"Surfer: The World's First Digital Footprint Exporter" is dubious—it's clearly not the first. Kicking off with such a bold claim while only supporting seven major platforms? A scraper like this is only valuable if it has hundreds of integrations; the more niche, the better. The idea is great, but this needs a lot more time in the oven.

I would prefer a cli tool with partial gather support. Something that I could easily setup to run on a cheap instance somewhere and have it scrape all my data continuously at set intervals, and then give me the data in the most readable format possible through an easy access path. I've been thinking of making something like that, but with https://github.com/microsoft/graphrag at the center of it. A continuously rebuilt GraphRAG of all your data.

By @MattJ100 - 5 months
Definitely not the first such scraper. DogSheep has been around for a while: https://dogsheep.github.io/

It is based around SQLite rather than Supabase (Postgres) which I think is a better choice for preservation/archival purposes.

By @Carrok - 5 months
No list of supported platforms. No example of what the extracted data looks like. No examples of what can be done with the extracted data.
By @hi-v-rocknroll - 5 months
The answers to online platforms trafficking in personal data and metadata is two parallel and concurrent efforts:

1. Much tougher data privacy regulations (needed per country)

2. A central trusted, international nonprofit clearinghouse and privacy grants/permissions repository that centralizes basic personal details and provides a central way to update name, address(es), email, etc. that are then used on-demand only by companies (no storage)

By doing these, it simplifies things greatly for people and allows someone to audit and see what every company knows about them, can know about, and can remove allowances for companies they don't agree to. One of the worst cases is the US where personal information is not owned by the individual and there is almost zero control unless it's health related, and can be traded for profit.

By @doctorpangloss - 5 months
The most exciting thing to happen to programming is the chatbot enabling millions of enthusiastic people to write code.
By @Xen9 - 5 months
A browser addon that takes one's password manager export & deletes every account, possibly after scraping the data, would be amazing. No one has done it and it can be done such that the system will eventually safely delete every account of every site (e.g using developers tools, accessability options, being intended to only be used in fresh browser install, sourcing information from volunteers). You'd have a sheet tracking the process, e.g. verification pending, manual intervention pending, deleted, waiting states. Many humans have hundreds of accounts they no longer use and this sort of tool could thus be a good Y-combinator or hobby project.
By @zamubafoo - 5 months
I made something like this since I was tired of the asymmetric nature of data collection that happens on the Internet. Still not where I would like to be, but it's been really nice being able to treat my browsing history as any old log that I can query over. Tools like dogsheep are nice, but they tend to rely on data being allowed to be removed from the platform. This bypasses those limits by just doing it on the client.

This lets me create dashboards to see usage for certain topics. For example, I have a "Dev Browser" which tracks the latest sites I've visited that are related to development topics [1]. I similarly have a few for all the online reading I do. One for blogs, one for fanfiction, and one for webfiction in general.

I've talked about my first iteration before on here [2].

My second iteration ended up with a userscript which sends the data on the sites I visit to a Vector instance (no affiliation; [3]). Vector is in there because for certain sites (ie. those behind draconian Cloudflare configuration), I want to save a local copy of the site. So Vector can pop that field save it to a local minio instance and at the same time push the rest of the record to something like Grafana Loki and Postgres while being very fast.

I've started looking into a third iteration utilizing MITMproxy. It helps a lot with saving local copies since it's happening outside of the browser, so I don't feel the hitch when a page is inordinately heavy for whatever reason. It also is very nice that it'd work with all browsers just by setting a proxy which means I could set it up for my phone both as a normal proxy or as a wireguard "transparent" proxy. Only need to set up certificates for it work.

---

[1] https://raw.githubusercontent.com/zamu-flowerpot/zamu-flower... [2] https://news.ycombinator.com/item?id=31429221 [3] http://vector.dev

By @AeZ1E - 5 months
the idea of personal data centralization sounds intriguing, but let's be real - companies will always find a way to keep a grip on our info. maybe it's time for a digital revolution, or just another excuse for me to procrastinate on coding.
By @BodyCulture - 5 months
„Centralize“ is a privacy anti-pattern. Max centralisation should be your keepass file.
By @captn3m0 - 5 months
I’ve been working on a lot of similar ideas over the years, and my current ideal stack is to:

1. Use Mobile App APIs.

2. Generate OpenAPI Arrazo Workflows.

1 ensures breakage is minimal, since mobile apps are slow upgrades and older versions are expected to keep working. 2 lets you write repeatable recipes using YAML, and that makes it quite portable to other systems.

The Arazzo spec is still quite early though, but I am hopeful of this approach.

By @mcslurryhole - 5 months
as someone who used to write scrapers for a living, this is going to break constantly. cool concept though.
By @bdcravens - 5 months
Myself, I'd probably prefer to use something like Huginn to create a customized approach to all of my online platforms I'm interested in, rather than a curated list.

https://github.com/huginn/huginn

By @slalani304 - 5 months
hey, sahil here. i'm one of the contributors on surfer and have been working on this project for around three weeks now. we appreciate the feedback and are excited to keep pushing this project forward with your input!
By @colordrops - 5 months
Seems like an easy way to get locked out of your accounts.
By @methyl - 5 months
Not to be confused with Surfer, SEO content optimization platform