August 19th, 2024

Distributed Locks with Redis

The Redlock algorithm enhances distributed locking with Redis, ensuring mutual exclusion and fault tolerance across multiple instances, requiring majority lock acquisition and prompt release to prevent resource unavailability.

Read original article

Distributed locks are essential for managing shared resources among multiple processes in a mutually exclusive manner. This document introduces the Redlock algorithm, a more robust approach to implementing distributed locks with Redis compared to simpler methods that may lack safety guarantees. The Redlock algorithm operates across multiple independent Redis instances, ensuring that mutual exclusion is maintained even in the event of node failures. The algorithm requires a client to attempt to acquire a lock on a majority of the Redis instances within a specified timeout. If successful, the lock's validity is adjusted based on the time taken to acquire it. The document emphasizes the importance of safety properties, such as mutual exclusion and fault tolerance, and outlines the risks associated with simpler implementations that rely on a single Redis instance. It also discusses the need for clients to release locks promptly if they fail to acquire a majority, to prevent prolonged unavailability of the resource. The Redlock algorithm is designed to be resilient against failures and provides a structured method for implementing distributed locks in various programming environments.

- Redlock is a distributed locking algorithm designed for use with Redis.

- It ensures mutual exclusion and fault tolerance across multiple Redis instances.

- The algorithm requires acquiring locks on a majority of instances within a specified timeout.

- Clients must release locks promptly if they fail to acquire a majority to avoid resource unavailability.

- The document highlights the limitations of simpler locking mechanisms that rely on a single Redis instance.

Synchronization Is Bad for Scale

Challenges of synchronization in scaling distributed systems include lock contention issues, discouraging lock use in high-concurrency settings. Alternatives like sharding, consistent hashing, and the Saga Pattern are suggested for efficient synchronization. Examples from Mailgun's MongoDB use highlight strategies for avoiding lock contention and scaling effectively, cautioning against excessive database reliance for improved scalability.

Synchronization Is Bad for Scale

Challenges of synchronization in scaling distributed systems are discussed, emphasizing issues with lock contention and proposing alternatives like sharding and consistent hashing. Mailgun's experiences highlight strategies to avoid synchronization bottlenecks.

Golang Sync Mutex: Normal and Starvation Mode

The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.

Distributed == Relational

The article explores how distributed systems can utilize relational database principles, advocating for parallel data gathering, triggers for function invocations, and user-friendly alternatives to SQL for efficient software development.

Constraining Writers in Distributed Systems

The document outlines strategies for improving reliability in distributed storage systems, focusing on copyset replication, quorum systems, and erasure coding to enhance data integrity and recovery.

9 comments

By @zinodaur - 8 months

Martin Kleppmann has some interesting thoughts on Redlock:

> I think the Redlock algorithm is a poor choice because it is “neither fish nor fowl”: it is unnecessarily heavyweight and expensive for efficiency-optimization locks, but it is not sufficiently safe for situations in which correctness depends on the lock.

https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

By @eurleif - 8 months

I helped! :) (A little.)

http://antirez.com/news/77

>The Hacker News user eurleif noticed how it is possible to reacquire the lock as a strategy if the client notices it is taking too much time in order to complete the operation. This can be done by just extending an existing lock, sending a script that extends the expire of the value stored at the key is the expected one. If there are no new partitions, and we try to extend the lock enough in advance so that the keys will not expire, there is the guarantee that the lock will be extended.

By @alexey-salmin - 8 months

Something I don't enjoy about remote/distributed locks is that unlike distributed transactions they're usually unable to provide any strict guarantees about things they protect.

E.g. if you algorithm is:

1) Hold the distributed lock

2) Do the thing

3) Release the lock

And the node goes dark for a while between steps 1 and 2 (e.g. 100% CPU load), by the time it reaches 2 the lock may have already expired and another node is holding it, resulting in a race. Adding steps like "1.1 double/triple check the lock is still held" obviously doesn't help because the node can go dark right after these and resume operation at 2. The probability of these is not too high, but still: no guarantees. Furthermore at a certain scale you do actually start seeing rogue nodes deemed dead hours ago suddenly coming back to life and doing unpleasant things.

The rule of thumb usually is "keep locks within the same transaction space as the thing they protect", and often you don't even needs locks in that case, just transactions can be enough by themselves. If you're trying to protect something that inherently un-transactional then, well, good luck because these efforts are always probabilistic in nature.

A good use-case for a remote lock would be when it's not actually used to guarantee consistency or avoid races, but merely tries to prevent duplicate calculations for cost/performance considerations. For all other cases I outright recommend avoiding them.

By @jetru - 8 months

Famously broken badly for anything mission critical.

By @Bogdanp - 8 months

To counter some of the hate in this thread: I have used this to great success as an "opportunistic" locking mechanism to, for example, reduce load on a Postgres database. The winner of the race to acquire the lock would run the (expensive) query then cache the result. On lock release, the nodes waiting on the lock would then try to read the cache before trying to acquire the lock again.

By @awinter-py - 8 months

redis is the easiest-to-host lock server and that's worth the risk in some applications (depending on consequence of errors obv)

inspiring + slightly terrifying that rather than a single server-side implementation, every client is responsible for its own implementation

if postgres provided fast kv cache and a lock primitive it would own

By @AtlasBarfed - 8 months

Where's the Jepsen suite tests?

Without it this is alphaware at best

By @anonzzzies - 8 months

I have seen (ad hoc) implementations go quite bad many times. I encounter them quite a bit as a distributed replacement for some type of db transaction where the db is something like rds; someone thought to be smart and write 'things at scale' (they read on reddit etc) while a db transaction would've been the correct solution and they didn't need scale anyway (or underestimated the current db capabilities for mysql/postgres).

Synchronization Is Bad for Scale

Golang Sync Mutex: Normal and Starvation Mode

The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.

Distributed == Relational

Constraining Writers in Distributed Systems

The document outlines strategies for improving reliability in distributed storage systems, focusing on copyset replication, quorum systems, and erasure coding to enhance data integrity and recovery.

Distributed Locks with Redis

Related

Synchronization Is Bad for Scale

Synchronization Is Bad for Scale

Golang Sync Mutex: Normal and Starvation Mode

Distributed == Relational

Constraining Writers in Distributed Systems

Related

Synchronization Is Bad for Scale

Synchronization Is Bad for Scale

Golang Sync Mutex: Normal and Starvation Mode

Distributed == Relational

Constraining Writers in Distributed Systems