September 14th, 2024

Valkey achieved one million RPS 6 months after forking from Redis

Valkey 8.0 RC2 achieves over 1.19 million requests per second through advanced memory access techniques, including speculative execution and interleaving, with a guide for performance reproduction on AWS EC2.

Read original articleLink Icon
Valkey achieved one million RPS 6 months after forking from Redis

Valkey has introduced significant performance enhancements in its latest version, achieving over 1.19 million requests per second (RPS) through advanced memory access techniques. The blog details how the team offloaded I/O operations to dedicated threads, allowing the main thread to focus on command execution. Profiling revealed that the main thread spent considerable time waiting for external memory access, prompting the implementation of speculative execution and memory access amortization techniques. By interleaving memory access operations, Valkey improved the efficiency of linked list traversals and dictionary lookups, reducing memory access latency. The new approach allows the processor to issue multiple memory accesses in parallel, significantly speeding up operations. For instance, a new interleaved function for summing linked list values reduced execution time from 20.8 seconds to under 2 seconds. Additionally, prefetching memory addresses further optimized performance. The blog also provides a guide for reproducing these performance results on an AWS EC2 instance, detailing hardware setup, server configuration, and benchmark parameters. Valkey 8.0 RC2 is now available for evaluation, showcasing the impact of these optimizations on overall system performance.

- Valkey 8.0 achieves over 1.19 million requests per second.

- Speculative execution and memory access amortization techniques enhance performance.

- Interleaving memory access operations reduces latency significantly.

- A guide is provided for reproducing performance results on AWS EC2.

- Valkey 8.0 RC2 is available for evaluation with new optimizations.

Link Icon 10 comments
By @nonane - 2 months
One thing the article doesn’t mention is how they figured out that waiting for external memory access is the bottleneck. Are there any profiling tools available that would tell the developer that the cpu is waiting for external memory x% of the time?
By @MobiusHorizons - 2 months
This is really cool work! I am surprised to see this level of tuning without using cache profiling or other performance counters to identify the bottleneck and quantify the improvement.
By @secondcoming - 2 months
Redis' biggest flaw is its single threaded design. We end up having to run separate redis processes on each core and have client side sharding. We're lucky that our data allows this.

We experiment with KeyDB too but I'm not sure what state that project is in.

By @PeterZaitsev - 2 months
Great to see Valkey team is making a progress well beyond keeping old Redis version Security Patched.
By @jacobgorm - 2 months
Who in their right mind uses linked lists for a database style workload? Try doing this with arrays to get a reasonable baseline.
By @throwaway81523 - 2 months
Nice, how does that compare with Pedis, which was written in C++ several years ago? It's an incomplete Redis lookalike that isn't getting current development, but it uses Seastar, the same parallelism framework as ScyllaDB.

https://github.com/fastio/1store

By @tayo42 - 2 months
The interleave thing isnt intuitive to me.

The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?