September 14th, 2024

Valkey achieved one million RPS 6 months after forking from Redis

Valkey 8.0 RC2 achieves over 1.19 million requests per second through advanced memory access techniques, including speculative execution and interleaving, with a guide for performance reproduction on AWS EC2.

Read original article

Valkey achieved one million RPS 6 months after forking from Redis

Valkey has introduced significant performance enhancements in its latest version, achieving over 1.19 million requests per second (RPS) through advanced memory access techniques. The blog details how the team offloaded I/O operations to dedicated threads, allowing the main thread to focus on command execution. Profiling revealed that the main thread spent considerable time waiting for external memory access, prompting the implementation of speculative execution and memory access amortization techniques. By interleaving memory access operations, Valkey improved the efficiency of linked list traversals and dictionary lookups, reducing memory access latency. The new approach allows the processor to issue multiple memory accesses in parallel, significantly speeding up operations. For instance, a new interleaved function for summing linked list values reduced execution time from 20.8 seconds to under 2 seconds. Additionally, prefetching memory addresses further optimized performance. The blog also provides a guide for reproducing these performance results on an AWS EC2 instance, detailing hardware setup, server configuration, and benchmark parameters. Valkey 8.0 RC2 is now available for evaluation, showcasing the impact of these optimizations on overall system performance.

- Valkey 8.0 achieves over 1.19 million requests per second.

- Speculative execution and memory access amortization techniques enhance performance.

- Interleaving memory access operations reduces latency significantly.

- A guide is provided for reproducing performance results on AWS EC2.

- Valkey 8.0 RC2 is available for evaluation with new optimizations.

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

Counting Bytes Faster Than You'd Think Possible

Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.

10 comments

By @nonane - 8 months

One thing the article doesn’t mention is how they figured out that waiting for external memory access is the bottleneck. Are there any profiling tools available that would tell the developer that the cpu is waiting for external memory x% of the time?

By @MobiusHorizons - 8 months

This is really cool work! I am surprised to see this level of tuning without using cache profiling or other performance counters to identify the bottleneck and quantify the improvement.

By @secondcoming - 8 months

Redis' biggest flaw is its single threaded design. We end up having to run separate redis processes on each core and have client side sharding. We're lucky that our data allows this.

We experiment with KeyDB too but I'm not sure what state that project is in.

By @PeterZaitsev - 8 months

Great to see Valkey team is making a progress well beyond keeping old Redis version Security Patched.

By @jacobgorm - 8 months

Who in their right mind uses linked lists for a database style workload? Try doing this with arrays to get a reasonable baseline.

By @throwaway81523 - 8 months

Nice, how does that compare with Pedis, which was written in C++ several years ago? It's an incomplete Redis lookalike that isn't getting current development, but it uses Seastar, the same parallelism framework as ScyllaDB.

https://github.com/fastio/1store

By @tayo42 - 8 months

The interleave thing isnt intuitive to me.

The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?

Valkey achieved one million RPS 6 months after forking from Redis

Related

Beating the L1 cache with value speculation (2021)

Counting Bytes Faster Than You'd Think Possible

Related

Beating the L1 cache with value speculation (2021)

Counting Bytes Faster Than You'd Think Possible