Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
Read original articleThe article discusses the concept of atomicless concurrency in building allocators to serve multiple threads efficiently. It explains the shift from per-thread caching to per-CPU data structures, reducing contention and avoiding atomic operations in the fast path. The post delves into implementing CPU-local data structures on modern Linux using restartable sequences and the rseq syscall. It details the process of enabling rseqs for threads, creating critical sections, and handling thread-local variables to ensure proper execution and cleanup. The article also covers the challenges of initializing critical sections in Rust due to limitations in referencing labels in inline assembly. Overall, it provides insights into optimizing concurrency mechanisms for performance-critical applications on Linux systems.
Related
Atomic Operations Composition in Go
The article discusses atomic operations composition in Go, crucial for predictable results in concurrent programming without locks. Examples show both reliable and unpredictable outcomes, cautioning about atomics' limitations compared to mutexes.
Learning C++ Memory Model from a Distributed System's Perspective (2021)
The article explores C++ memory model in distributed systems, emphasizing std::memory_order for synchronization. It covers happens-before relationships, release-acquire ordering, and memory_order_seq_cst for total ordering and synchronization across threads.
Properly Testing Concurrent Data Structures
The article explores testing concurrent data structures using the Rust library loom. It demonstrates creating property tests with managed threads to simulate concurrent behavior, emphasizing synchronization challenges and design considerations.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Beating the Compiler
The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.
HN title is more accurate!
Related
Atomic Operations Composition in Go
The article discusses atomic operations composition in Go, crucial for predictable results in concurrent programming without locks. Examples show both reliable and unpredictable outcomes, cautioning about atomics' limitations compared to mutexes.
Learning C++ Memory Model from a Distributed System's Perspective (2021)
The article explores C++ memory model in distributed systems, emphasizing std::memory_order for synchronization. It covers happens-before relationships, release-acquire ordering, and memory_order_seq_cst for total ordering and synchronization across threads.
Properly Testing Concurrent Data Structures
The article explores testing concurrent data structures using the Rust library loom. It demonstrates creating property tests with managed threads to simulate concurrent behavior, emphasizing synchronization challenges and design considerations.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Beating the Compiler
The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.