June 24th, 2024

Resilient Sync for Local First

The "Local-First" concept emphasizes empowering users with data on their devices, using Resilient Sync for offline and online data exchange. It ensures consistency, security, and efficient synchronization, distinguishing content changes and optimizing processes. The method offers flexibility, conflict-free updates, and compliance documentation, with potential enhancements for data size, compression, and security.

Read original articleLink Icon
Resilient Sync for Local First

The article discusses the concept of "Local-First" in data synchronization, aiming to empower users by bringing data back to their devices while leveraging internet advantages. The proposed Resilient Sync method suggests a simple data exchange format that works both offline and online, ensuring data consistency and security. Each client maintains a log of changes linked to a unique identifier, enabling easy replication and backup. The approach distinguishes between content changes and larger assets like images, optimizing synchronization processes. Implementation can be in databases or file systems, offering flexibility based on service requirements. The method's benefits include self-initiated data retrieval, conflict-free updates, and comprehensive documentation for compliance purposes. Potential refinements involve data size control, compression, and cryptographic enhancements. The approach prioritizes simplicity, flexibility, and adaptability to evolving technologies, promoting a future-proof solution for data synchronization.

Link Icon 7 comments
By @ergl - 5 months
The ideas described here are very similar to what SSB (https://ssbc.github.io/docs/ssb/faq.html) implemented.

The main problems with having a log is that it grows with every single change (how granular are your changes? with CRDTs, any mutation, no matter how small, is a change). Questions of data retention (is your protocol tolerant to missing log entries?) or data rewriting (if your log is a merkle tree, rewriting something in the past means rewriting all subsequent entries) are also open.

The post also mentions that the log entries could be CRDTs. But if that's so, then you don't need a log at all, since all the information you need to compute the minimal delta to sync between peers is inside the CRDT itself. For a good overview of how this could be done, see (specifically the "Calculating Diffs" section): https://ditto.live/blog/dittos-delta-state-crdts (disclaimer: I work at ditto)

By @gwbas1c - 5 months
I was desktop client lead for Syncplicity (major Dropbox competitor) for almost a decade.

Some thoughts:

Most important: Git (the protocol) kinda-sorta does this already. I personally don't know if git is CRDT; but the .git folder is a database that contains the entire history of the repos. The "git" command can be used to sync a repository among different computers. You don't need to use Github or set up a git server. (But everyone does it that way because it's much, much easier.)

Secondly: The proposal made assumes that everyone will rewrite their software to be CRDT-based and fit into this schema. Writing software this way is hard, and then we need to convince the general public to adopt versions of software that are based around this system. Could you port LibreOffice to writing out documents this way?

Thirdly: Resolving conflicts when computers are disconnected from the network is always a mess for end users. (This was a huge pain point for Syncplicity's users, and is a huge pain point in other "sync" products.) CRDTs don't "solve" the problem for users: The best you can hope for is something like git where it identifies merge conflicts, and then the user has to go in and clean them up.

For example: Two computers are disconnected from the network. Edits are made to the same sentence in a document while disconnected. What happens? The CRDT approach might result in "consistent data," but it cannot "read minds" and know what is the correct result.

----

IMO, some things I would consider:

Try to somehow replicate the desired workflows with git or a similar tool. (Remember Mercurial?) See what kind of problems pop up.

Consider writing a "git aware" word processor that is based around Markdown; and is somewhat aware of how git behaves around conflicts.

Try "bolting on" a sync protocol (or git) to LibreOffice as a way to understand just how easy / hard a project like this is.

Consider encapsulating the entire database in a SQLite file, instead of using the filesystem.

By @gritzko - 5 months
Excellent. This approach to CRDTs existed even before the term itself was invented. In the 2010 article[2] on Causal Trees[1], your humble servant calls these per-peer op logs "yarns". In the 2011, observing the proliferation of proposals, Shapiro&friends propose[3] the term "CRDT".

That is essentially a partially-ordered approach to oplogs. Oplogs (binlogs, WALs) underlie the entire universe of databases. The problems are exactly the same as the problems "normal" database developers faced in the far past: how much history do you want to keep? How to synchronize the log and the state reliably? What is the format of the log? How to store/transfer the log reliably?

The last question seems trivial, but it is not. If you read some databases' source code, you may find a lot of paranoid things[4], obviously inspired by humiliating real-life incidents. The other questions are not trivial by any means.

So, yes, this is the way[5].

[1]: Alexei "archagon" Baboulevitch excellent popular summary of Causal Trees, 2018 http://archagon.net/blog/2018/03/24/data-laced-with-history/

[2]: The 2010 paper https://www.researchgate.net/publication/221367739_Deep_hype...

[3]: The CRDT paper https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf

[4]: e.g. https://github.com/facebook/rocksdb/blob/main/db/log_reader....

[5]: the simultaneous post by Nikita Prokopov https://tonsky.me/blog/crdt-filesync/

By @SushiHippie - 5 months
Cross posting my comment from the other thread about local first https://news.ycombinator.com/item?id=40786425

My comment may fit a bit better here as this post talks about a protocol instead.

---

https://remotestorage.io/ was a protocol intended for this.

IIRC the visison was that all applications could implement this and you could provide that application with your remotestorage URL, which you could self host.

I looked into this some time ago as I was fed up with WebDAV being the only viable open protocol for file shares/synchronization (especially after hosting my own NextCloud instance, which OOMed because the XML blobs for a large folder it wanted to create as a response used too much memory) and found it through this gist [0] which was a statement about Flock [1] shutting down.

It looks like a cool and not that complex protocol, but all the implementations seem to be unmaintained.

And the official javascript client [2] seems to be ironically be used mostly to access Google Drive or DropBox

Remotestorage also has an internet draft https://datatracker.ietf.org/doc/draft-dejong-remotestorage/ which is relatively easy to understand and not very long.

[0] https://gist.github.com/rhodey/873ae9d527d8d2a38213

[1] https://github.com/signalapp/Flock

[2] https://github.com/remotestorage/remotestorage.js

By @vlovich123 - 5 months
The challenge of course is that if your write depends on a previous read, offline/online sync can easily in the wrong sync state once you come back online (in the general case). Even CRDTs are not immune from this - for example your offline document copy says step 1 is delete line 3 so you delete it but then you sync and step 1 was corrected to be “delete line 4” your deletion of line 3 is now incorrect even though the merged result is “valid”. We see this everyday in code development - they’re called merge conflicts.
By @ngrilly - 5 months
The approach seems similar to Delta Lake's consistency model, using object storage like S3, and yet allows concurrent writers and readers: https://jack-vanlightly.com/analyses/2024/4/29/understanding...