Imagining a personal data pipeline
The article advocates for a personal data pipeline to manage personal data effectively, addressing challenges like data duplication and trust issues, while proposing an open-source, user-friendly solution for privacy and control.
Read original articleThe article discusses the concept of a personal data pipeline, emphasizing the importance of managing and utilizing personal data generated through daily activities. The author reflects on their own experiences with data collection, expressing a desire to consolidate various data sources while maintaining control over their information. They highlight the challenges of relying on third-party applications, including data duplication, trust issues, and potential data loss if services shut down or are acquired by untrustworthy companies. The author proposes a system inspired by professional data pipelines, which would allow users to extract, store, and transform their data in a secure and manageable way. Key components of this system include a data getter for importing data, a key store for managing API credentials, a JSON store for data storage, and a processor for transforming data. The author envisions a user-friendly interface that simplifies the process of managing personal data while ensuring privacy and control. The overall goal is to create a sustainable and open-source solution that empowers individuals to take charge of their data without becoming overwhelmed by the technical complexities involved.
- The author advocates for a personal data pipeline to manage and utilize personal data effectively.
- Key challenges include data duplication, trust issues with third-party apps, and potential data loss.
- The proposed system draws inspiration from professional data pipelines, focusing on user control and privacy.
- Components of the system include data getters, key stores, JSON stores, and processors for data transformation.
- The goal is to create an open-source, user-friendly solution for personal data management.
- Many users express skepticism about the practical value of aggregating personal data, questioning how often they would actually use it.
- Several commenters share their own projects or tools that aim to address similar challenges, highlighting a variety of approaches to personal data management.
- Concerns about data privacy and the potential for creating additional digital chores are prevalent, with some preferring manual methods for sensitive tasks.
- There is a recognition of the complexity involved in unifying data from different sources, with some suggesting alternative frameworks or languages like Datalog for easier querying.
- Some users reference existing tools and projects that have attempted to solve these issues, indicating a community of interest in personal data management solutions.
> ... There is a whole bunch of toil baked into this hobby and I'm wary of creating an endless source of digital chores for myself
Always stops me from pursuing it seriously. Also that I already do a fair amount of ELT at work and couldn’t tolerate it as a hobby.
But this framework makes sense. Seems like the idea is connector, schema mapping, and datatype standardization configured in one place. It’s a well thought-out framework and I actually have an internal platform at work that accomplishes something very similar, albeit for a totally different purpose.
But I also personally wouldn’t see a ton of value from this, except if it were used for bills, taxes, and financial management. But then the privacy aspect becomes paramount. There’s a reason people just have to do that stuff manually.
I would be surprised if Apple and Google didn’t eventually start to build something like this. Google is already pretty good at unifying email and calendar. It’s something that’s really only possible to deploy at the mobile OS level, because any other alternative would involve sending all of your data to a third party platform, which for most people these days is a non-starter. Plus with LLMs being a thing, the perfect interface to centralized/standardized personal data now exists.
https://simonwillison.net/2020/Nov/14/personal-data-warehous...
Here is more about dogsheep — (“ Dogsheep is a collection of tools for personal analytics using SQLite and Datasette.”)
https://nicolasbouliane.com/projects/timeline
The newer version is basically a static site generator running on top of my data. The older version was actively fetching the data from various sources, but it was getting a little unwieldy.
The biggest challenge is to automatically get your data out of Google Photos, social networks, and even your own phone. All of my handwritten notes and my sketches are stuck on my iPad and must be manually exported. It's tedious and unsustainable.
Same with social networks. Data must be manually exported. There is also no simple, human-readable file format for social media posts. You have to parse the export format they choose to use at the moment.
There's a lot of attempts to solve this problems but not much has been found, possibly because the whole setup of ELT processes is a lot of chores (just think about the whole inconsistent formats of data across services). It's like having a second job in data engineering, and I'm not even remotely in the software/data industry! I just like and do coding as a hobby.
(Founder here)
The fundamental query I usually use is substring search. The only contents is text, because I believe in the primacy of plaintext. The notes for the last 4 years of my life takes up 60 megs, and it takes half a second on a 5 year old android phone to parse all of it, and less than 50ms to search through all of it, so I can do it on every keystroke/ incrementally.
[0] Pipeline Notes: https://github.com/kasrasadeghi/pipeline-js
I'm not a web developer by trade, so if anyone has any feedback on security/UI/service workers, please let me know!
The project was to try and collect and make searchable my "internet self". something I called "personal search". i.e. the idea was to try to index every web page I look at, every e-mail I get / see. Social media content shared to me by my social graph (using said APIs), and further indexing pages shared.
The indexing itself wasn't the hard part (per se, as at the time, the APIs facebook, twitter et al were very expansive, much more limited these days, one can attempt to deal with it with intelligent dom scraping, but that's a never ending race where your sources are consistently changing things and you are chasing their changes), the question is how does this information really create significant value for myself? i.e. how often am I going to actually be searching this personal archive. I search google many times a day to find new things (or refind things I already found through it), but how often do I search for things that are within my personal index during a normal day? a handful at most I'd think (and many times, not even once).
With that said, the concept that I then decided to try and teach myself was trying to write a browser extension that could do a similarity search (of sorts) between the documents in my personal index and the web page that I'm currently looking at + content related to that (ex: looking at a news article about current events, idea is that it should surface other articles your friends have shared on the topic and their comments on it). That ended up being an area i didn't have the time (or expertise) to really go far with, so it sort of ended there.
Usually your data will comfortably fit in a file. Your data getter emits these files in facts (essentially relations). If you want structures data, it can also be a single column that is of some struct type (similar to protobuf).
With all data available the problem becomes one of querying. With a good enough query language and system, you write these can data transformations via Datalog rules which roughly correspond to database views.
It is always possible to write queries in code in a general purpose language, but is a bit clumsy and hard to get an overview or reuse. It may also be possible to do SQL but it SQL is not very compositional and you ask yourself whether the base data representation should be adapted refactored. Essentially you do not want to think about the optimal schema or set of structs but just do transformations you need in the lowest friction way.
With Datalog you may benefit from a unified representation (everything is "facts") and the transformations to useful different formats (different kinds of facts) can be factored and reused. It may mean duplication and denormalization but usually that does not matter.
Mangle supports aggregation and even calling some custom functions during query evaluation. The repo is at https://github.com/google/mangle and obviously there remains a lot to do, the API is unstable, there are bugs and the type checker is not finished... but a number of people and projects seem to use it. Even if you do not use it, it may give you how to use facts (relations) as a unified data structure for your project.
If there was a way to add a meta-prompt to Windows Recall like "Create a log entry every time I watch something with its title and URL" it could serve as a history whether things were watched on YouTube, Vimeo, or any other site, without requiring plugging into each service individually. Repeat ad nauseum for each thing to be logged, or perhaps someone can come up with a more clever query than I that catches everything sufficiently.
The level of granularity on many services might be surprisingly large, preventing introspection of the data at a useful level.
The point is that with classic tools IS EASY, limited only by the current sorry state of IT things, with modern tools is a nightmare that demand much effort to be done.
There was seemingly IMHO a lot of protest the dude didn't make some kind of php script or something one off & "simple" for the job.
But they'd used existing tools.to make a real data pipeline. And potentially could keep making new tools around similar pipelines. They had invested in some pipe building technology and it felt like no one was interested in giving credit for that.
Seperately, Karlicoss has HPI as their personal data toolkit/pipeline, and a massive map of services & data systems they've roped into HPI (and HPI-near) systems. https://beepb00p.xyz/hpi.html https://beepb00p.xyz/myinfra.html