August 7th, 2024

Imagining a personal data pipeline

The article advocates for a personal data pipeline to manage personal data effectively, addressing challenges like data duplication and trust issues, while proposing an open-source, user-friendly solution for privacy and control.

Read original articleLink Icon
CuriositySkepticismEnthusiasm
Imagining a personal data pipeline

The article discusses the concept of a personal data pipeline, emphasizing the importance of managing and utilizing personal data generated through daily activities. The author reflects on their own experiences with data collection, expressing a desire to consolidate various data sources while maintaining control over their information. They highlight the challenges of relying on third-party applications, including data duplication, trust issues, and potential data loss if services shut down or are acquired by untrustworthy companies. The author proposes a system inspired by professional data pipelines, which would allow users to extract, store, and transform their data in a secure and manageable way. Key components of this system include a data getter for importing data, a key store for managing API credentials, a JSON store for data storage, and a processor for transforming data. The author envisions a user-friendly interface that simplifies the process of managing personal data while ensuring privacy and control. The overall goal is to create a sustainable and open-source solution that empowers individuals to take charge of their data without becoming overwhelmed by the technical complexities involved.

- The author advocates for a personal data pipeline to manage and utilize personal data effectively.

- Key challenges include data duplication, trust issues with third-party apps, and potential data loss.

- The proposed system draws inspiration from professional data pipelines, focusing on user control and privacy.

- Components of the system include data getters, key stores, JSON stores, and processors for data transformation.

- The goal is to create an open-source, user-friendly solution for personal data management.

AI: What people are saying
The comments reflect a mix of personal experiences and insights related to managing personal data pipelines.
  • Many users express skepticism about the practical value of aggregating personal data, questioning how often they would actually use it.
  • Several commenters share their own projects or tools that aim to address similar challenges, highlighting a variety of approaches to personal data management.
  • Concerns about data privacy and the potential for creating additional digital chores are prevalent, with some preferring manual methods for sensitive tasks.
  • There is a recognition of the complexity involved in unifying data from different sources, with some suggesting alternative frameworks or languages like Datalog for easier querying.
  • Some users reference existing tools and projects that have attempted to solve these issues, indicating a community of interest in personal data management solutions.
Link Icon 17 comments
By @plaidfuji - 5 months
I also think about this general problem a fair amount, but this:

> ... There is a whole bunch of toil baked into this hobby and I'm wary of creating an endless source of digital chores for myself

Always stops me from pursuing it seriously. Also that I already do a fair amount of ELT at work and couldn’t tolerate it as a hobby.

But this framework makes sense. Seems like the idea is connector, schema mapping, and datatype standardization configured in one place. It’s a well thought-out framework and I actually have an internal platform at work that accomplishes something very similar, albeit for a totally different purpose.

But I also personally wouldn’t see a ton of value from this, except if it were used for bills, taxes, and financial management. But then the privacy aspect becomes paramount. There’s a reason people just have to do that stuff manually.

I would be surprised if Apple and Google didn’t eventually start to build something like this. Google is already pretty good at unifying email and calendar. It’s something that’s really only possible to deploy at the mobile OS level, because any other alternative would involve sending all of your data to a third party platform, which for most people these days is a non-starter. Plus with LLMs being a thing, the perfect interface to centralized/standardized personal data now exists.

By @LeonB - 5 months
HN user SimonW who created Datasette gave a talk in 2020 on “dogsheep” his tool for harvesting and processing personal data from a series of third parties.

https://simonwillison.net/2020/Nov/14/personal-data-warehous...

Here is more about dogsheep — (“ Dogsheep is a collection of tools for personal analytics using SQLite and Datasette.”)

https://dogsheep.github.io/

By @sevazhidkov - 5 months
My friend and I had a similar idea a few years ago, so we’ve built a prototype of a tool that converts personal data exports to a single SQLite database: https://github.com/bionic/bionic (repo includes “popular Spotify songs when I’m in transit according to Google Maps” query). Unfortunately, we haven’t found ourselves actually using the aggregated data: we’ve looked on it a few times, but it didn’t end up solving some real pain. It was fun to build though!
By @nicbou - 5 months
I have built a timeline thing to gather all of my data as an augmented diary.

https://nicolasbouliane.com/projects/timeline

The newer version is basically a static site generator running on top of my data. The older version was actively fetching the data from various sources, but it was getting a little unwieldy.

The biggest challenge is to automatically get your data out of Google Photos, social networks, and even your own phone. All of my handwritten notes and my sketches are stuck on my iPad and must be manually exported. It's tedious and unsustainable.

Same with social networks. Data must be manually exported. There is also no simple, human-readable file format for social media posts. You have to parse the export format they choose to use at the moment.

By @jskherman - 5 months
This is the whole main challenge of the quantified self movement all over again.

There's a lot of attempts to solve this problems but not much has been found, possibly because the whole setup of ELT processes is a lot of chores (just think about the whole inconsistent formats of data across services). It's like having a second job in data engineering, and I'm not even remotely in the software/data industry! I just like and do coding as a hobby.

By @yevpats - 5 months
This is why we built CloudQuery (https://github.com/cloudquery/cloudquery) an open source high performance ELT framework powered by Apache Arrow (framework is open source, our connectors closed source). You can run local pipeline and write plugins (extractors) in Go, Python, Javascript and any other language and save data to any destination (files, SQLite, DuckDB, PostgreSQL, ...)

(Founder here)

By @kaz-inc - 5 months
I have a project I've built that's somewhat like this, ironically called Pipeline [0]. It's a manual entry timestamped note taking system, and the UI is like messaging yourself. I've set it up over a wireguard VPN server and it connects all of my devices, it works offline as a PWA, and I've tested it on chrome/Firefox/safari on iOS/Linus/android/macos/windows. It mostly works on all of those platforms and some of my friends/family use it to take notes for themselves.

The fundamental query I usually use is substring search. The only contents is text, because I believe in the primacy of plaintext. The notes for the last 4 years of my life takes up 60 megs, and it takes half a second on a 5 year old android phone to parse all of it, and less than 50ms to search through all of it, so I can do it on every keystroke/ incrementally.

[0] Pipeline Notes: https://github.com/kasrasadeghi/pipeline-js

I'm not a web developer by trade, so if anyone has any feedback on security/UI/service workers, please let me know!

By @compsciphd - 5 months
About a decade ago, I took on a personal toy project to try and teach myself larger scale programming in java, as well as the APIs provided by multiple internet services (google, twitter, facebook, ....)

The project was to try and collect and make searchable my "internet self". something I called "personal search". i.e. the idea was to try to index every web page I look at, every e-mail I get / see. Social media content shared to me by my social graph (using said APIs), and further indexing pages shared.

The indexing itself wasn't the hard part (per se, as at the time, the APIs facebook, twitter et al were very expansive, much more limited these days, one can attempt to deal with it with intelligent dom scraping, but that's a never ending race where your sources are consistently changing things and you are chasing their changes), the question is how does this information really create significant value for myself? i.e. how often am I going to actually be searching this personal archive. I search google many times a day to find new things (or refind things I already found through it), but how often do I search for things that are within my personal index during a normal day? a handful at most I'd think (and many times, not even once).

With that said, the concept that I then decided to try and teach myself was trying to write a browser extension that could do a similarity search (of sorts) between the documents in my personal index and the web page that I'm currently looking at + content related to that (ex: looking at a news article about current events, idea is that it should surface other articles your friends have shared on the topic and their comments on it). That ended up being an area i didn't have the time (or expertise) to really go far with, so it sort of ended there.

By @burakemir - 5 months
For the subproblem of being able to unify and query various data sources in different formats, I would suggest to take a look at Datalog and specifically Mangle, my implementation of it. I don't want to plug the project here but more describe the approach.

Usually your data will comfortably fit in a file. Your data getter emits these files in facts (essentially relations). If you want structures data, it can also be a single column that is of some struct type (similar to protobuf).

With all data available the problem becomes one of querying. With a good enough query language and system, you write these can data transformations via Datalog rules which roughly correspond to database views.

It is always possible to write queries in code in a general purpose language, but is a bit clumsy and hard to get an overview or reuse. It may also be possible to do SQL but it SQL is not very compositional and you ask yourself whether the base data representation should be adapted refactored. Essentially you do not want to think about the optimal schema or set of structs but just do transformations you need in the lowest friction way.

With Datalog you may benefit from a unified representation (everything is "facts") and the transformations to useful different formats (different kinds of facts) can be factored and reused. It may mean duplication and denormalization but usually that does not matter.

Mangle supports aggregation and even calling some custom functions during query evaluation. The repo is at https://github.com/google/mangle and obviously there remains a lot to do, the API is unstable, there are bugs and the type checker is not finished... but a number of people and projects seem to use it. Even if you do not use it, it may give you how to use facts (relations) as a unified data structure for your project.

By @curiousthought - 5 months
This is actually a great use case for something like Windows Recall. Ingestion of data after the fact requires the data to be discoverable.

If there was a way to add a meta-prompt to Windows Recall like "Create a log entry every time I watch something with its title and URL" it could serve as a history whether things were watched on YouTube, Vimeo, or any other site, without requiring plugging into each service individually. Repeat ad nauseum for each thing to be logged, or perhaps someone can come up with a more clever query than I that catches everything sufficiently.

The level of granularity on many services might be surprisingly large, preventing introspection of the data at a useful level.

By @noelwelsh - 5 months
I always wonder what people would do with this data. I have a more "tears in the rain" approach. I just don't think there is that much value in, say, my old workout logs. I cannot see how capturing and analyzing it all would make my life much better. I feel if there was one crazy hack that would, say, make me stronger, the people who's job is to get as strong as possible would have already found it. (It's probably PEDs.)
By @nilirl - 5 months
I made something for just this use. It's very simple, but it gives you all your data in compressed JSON. I use it myself for mostly diet and exercise, but I also log other things like movies and books I'm reading. I made it because I wanted a nice interface to review my logs.

https://www.idiotlamborghini.com/strategies/weave

By @smolder - 5 months
The promise of computing that never materialized was having software to track everything you've ever done, and leveraging that data for your own benefit and no one else's.
By @kkfx - 5 months
Well... I have mine in Emacs/org-mode/org-roam managed notes, integrated being in a single integrated platform with no need of extra code, with my mails (notmuch), contacts (org-contacts), financial transactions (beancount), file (org-attach-ed) etc down to the infra/OS config (NixOS tangled from org-mode, org-babel blocks, as per the Emacs, zsh, mplayer, ... configs).

The point is that with classic tools IS EASY, limited only by the current sorry state of IT things, with modern tools is a nightmare that demand much effort to be done.

By @beaugunderson - 5 months
oh man... we had so much of this done years ago with Locker, but old services die and new services are born, and existing services change their APIs constantly... it takes a lot of work to keep them up to date! https://github.com/LockerProject/Locker
By @jonahbenton - 5 months
See also Perkeep, from Brad Fitzpatrick

https://en.m.wikipedia.org/wiki/Perkeep

By @jauntywundrkind - 5 months
I'm struggling to remember/find the exact story but like 6 months ago, some dev had built a guestlist or q&a & used some off the shelf Notion-y thing - maybe some form builder tool.

There was seemingly IMHO a lot of protest the dude didn't make some kind of php script or something one off & "simple" for the job.

But they'd used existing tools.to make a real data pipeline. And potentially could keep making new tools around similar pipelines. They had invested in some pipe building technology and it felt like no one was interested in giving credit for that.

Seperately, Karlicoss has HPI as their personal data toolkit/pipeline, and a massive map of services & data systems they've roped into HPI (and HPI-near) systems. https://beepb00p.xyz/hpi.html https://beepb00p.xyz/myinfra.html