July 12th, 2024

Atomicless Per-Core Concurrency

The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.

Read original article

The article discusses the concept of atomicless concurrency in building allocators to serve multiple threads efficiently. It explains the shift from per-thread caching to per-CPU data structures, reducing contention and avoiding atomic operations in the fast path. The post delves into implementing CPU-local data structures on modern Linux using restartable sequences and the rseq syscall. It details the process of enabling rseqs for threads, creating critical sections, and handling thread-local variables to ensure proper execution and cleanup. The article also covers the challenges of initializing critical sections in Rust due to limitations in referencing labels in inline assembly. Overall, it provides insights into optimizing concurrency mechanisms for performance-critical applications on Linux systems.

Atomic Operations Composition in Go

The article discusses atomic operations composition in Go, crucial for predictable results in concurrent programming without locks. Examples show both reliable and unpredictable outcomes, cautioning about atomics' limitations compared to mutexes.

Learning C++ Memory Model from a Distributed System's Perspective (2021)

The article explores C++ memory model in distributed systems, emphasizing std::memory_order for synchronization. It covers happens-before relationships, release-acquire ordering, and memory_order_seq_cst for total ordering and synchronization across threads.

Properly Testing Concurrent Data Structures

The article explores testing concurrent data structures using the Rust library loom. It demonstrates creating property tests with managed threads to simulate concurrent behavior, emphasizing synchronization challenges and design considerations.

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

Beating the Compiler

The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.

2 comments

By @jiehong - 10 months

The article seems to use “CPU” as a “CPU core”. This isn’t about multisocket systems.

HN title is more accurate!

By @tithos - 10 months

Your mini map is amazing. Im stealing it.

Atomicless Per-Core Concurrency

Related

Atomic Operations Composition in Go

Learning C++ Memory Model from a Distributed System's Perspective (2021)

Properly Testing Concurrent Data Structures

Beating the L1 cache with value speculation (2021)

Beating the Compiler

Related

Atomic Operations Composition in Go

Learning C++ Memory Model from a Distributed System's Perspective (2021)

Properly Testing Concurrent Data Structures

Beating the L1 cache with value speculation (2021)

Beating the Compiler