Valkey achieved one million RPS 6 months after forking from Redis
Valkey 8.0 RC2 achieves over 1.19 million requests per second through advanced memory access techniques, including speculative execution and interleaving, with a guide for performance reproduction on AWS EC2.
Read original articleValkey has introduced significant performance enhancements in its latest version, achieving over 1.19 million requests per second (RPS) through advanced memory access techniques. The blog details how the team offloaded I/O operations to dedicated threads, allowing the main thread to focus on command execution. Profiling revealed that the main thread spent considerable time waiting for external memory access, prompting the implementation of speculative execution and memory access amortization techniques. By interleaving memory access operations, Valkey improved the efficiency of linked list traversals and dictionary lookups, reducing memory access latency. The new approach allows the processor to issue multiple memory accesses in parallel, significantly speeding up operations. For instance, a new interleaved function for summing linked list values reduced execution time from 20.8 seconds to under 2 seconds. Additionally, prefetching memory addresses further optimized performance. The blog also provides a guide for reproducing these performance results on an AWS EC2 instance, detailing hardware setup, server configuration, and benchmark parameters. Valkey 8.0 RC2 is now available for evaluation, showcasing the impact of these optimizations on overall system performance.
- Valkey 8.0 achieves over 1.19 million requests per second.
- Speculative execution and memory access amortization techniques enhance performance.
- Interleaving memory access operations reduces latency significantly.
- A guide is provided for reproducing performance results on AWS EC2.
- Valkey 8.0 RC2 is available for evaluation with new optimizations.
Related
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.
We experiment with KeyDB too but I'm not sure what state that project is in.
The problem with linked lists is the memory address of nodes isn't necessarily contiguous because of malloc and the value could be NULL? Why does interleave loop make it faster for the cpu? It still a linked list, arbitrary memory, could be NULL? Not sure what im missing here?
Related
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Counting Bytes Faster Than You'd Think Possible
Matt Stuchlik's high-performance computing method counts bytes with a value of 127 in a 250MB stream, achieving 550 times faster performance using SIMD instructions and an innovative memory read pattern.