Async hazard: MMAP is blocking IO
Memory-mapped I/O can cause blocking I/O in asynchronous programming, leading to performance issues. Conventional I/O methods outperform it unless data is cached in memory, highlighting risks in concurrent applications.
Read original articleMemory mapping files for reading can simplify file access but introduces significant performance issues in asynchronous programming. Huon Wilson's experiments reveal that using memory-mapped I/O with async/await in Rust leads to blocking I/O, causing concurrent code to execute sequentially. This results in slower performance, underutilization of resources, and increased latency. Benchmarks conducted on an M1 MacBook Pro demonstrated that while memory-mapped I/O with async/await showed no concurrency improvement, conventional I/O methods performed significantly better. The underlying issue stems from how operating systems handle memory-mapped files, which can block threads during data loading from disk. This blocking behavior disrupts the cooperative scheduling of async tasks, preventing the executor from switching to other tasks. However, when data is already cached in memory, memory-mapped I/O can outperform conventional I/O due to reduced overhead. The findings suggest that while memory-mapped I/O offers a convenient API, it poses risks in concurrent environments, particularly when data is not readily available in memory.
- Memory-mapped I/O can lead to blocking I/O in async programming, causing performance issues.
- Benchmarks show that conventional I/O methods outperform memory-mapped I/O in async contexts.
- Blocking behavior occurs when data is not cached in memory, disrupting async task scheduling.
- Memory-mapped I/O can be faster than conventional I/O when data is already cached.
- Caution is advised when using memory-mapped files in concurrent applications.
Related
Download Accelerator – Async Rust Edition
This post explores creating a download accelerator with async Rust, emphasizing its advantages over traditional methods. It demonstrates improved file uploads to Amazon S3 and provides code for parallel downloads.
Synchronous Core, Asynchronous Shell
A software architecture concept, "Synchronous Core, Asynchronous Shell," combines functional and imperative programming for clarity and testing. Rust faces challenges integrating synchronous and asynchronous parts, prompting suggestions for a similar approach.
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
Golang Sync Mutex: Normal and Starvation Mode
The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.
Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)
The article details a memory leak issue in a pricing engine using mimalloc, revealing that its internal bookkeeping caused memory retention. Restructuring to a single-threaded approach improved memory management.
- Many commenters acknowledge mmap's power and utility but caution that it requires expert knowledge due to its complexity and potential performance pitfalls.
- There are concerns about blocking behavior in async frameworks, with some suggesting that traditional threading may be more reliable.
- Commenters highlight the risks of unexpected page faults and the need for robust error handling when using mmap.
- Some argue that the benchmarks presented in the article may not accurately reflect real-world scenarios, particularly regarding thread usage.
- References to external literature and previous studies indicate a broader context of ongoing discussions about mmap's implications in system design.
Mmap is an interesting tool for system builders. It's super powerful, and super useful. But it's also kind of dangerous because the gap between happy case and worst case performance is so large. That makes benchmarking hard, adds to the risk of stability bugs, and complicates taming tail latency. It's behavior also varies a lot between OSs.
It's also nice to see all the data in this post. Too many systems design conversations are just dueling assertions.
That probably should have been the first thing to try. Too mad the mmap2 crate does not expose this.
Also looking at the mmap2 crate, it chooses some rather opinionated defaults depending on which function you actually call, and it makes accessing things like HUGEPAGE maps somewhat difficult.. and for whatever reason includes the MMAP_STACK flag when you call through this path.
I feel like a lot of rust authors put faith in crates that, upon inspection, are generally poorly designed and do not expose the underlying interface properly. It's a bad crutch for the language.
It gets even unwieldy, and now you have to add additional tracking where access is to be expected. Blindly delegating mmap area to any code path that does not have such handling, and you would have to deal with these failures.
Maybe that's not the case on Linux/OSX/BSD, but definitely is on Windows where you would have it. Also in C/C++ land you have to handle this using SEH - e.g. `__try/__except` - standard C++ handling won't cut it (I guess in other systems these would be through some signals (?)).
In any case, it might seem like an easy path to achieve glory, yet riddled with complications.
- (minor issue) async example is artificially limited to 1 thread (the article states that). The issue is comparing 8 OS threads no async to 1 thread async is fundamentally not very useful as long as you didn't pin all threads to the same physical core.. So in general you should compare something async with num_cpus threads vs. num_cpus*X OS threads. Through this wouldn't have been that useful in this example without pinning the tokio async threads to CPUs to forcefully highlight the page issue, and doing it is bothersome so I wouldn't have done so either.
- (bigger issue) The singled thread async "traditional IO" example is NOT single threaded. Async _file_ IO anything between not a thing or very bad in most OSes hence most async engines including tokio do file IO in worker threads. This means the "single threaded" conventional IO async example is running 8 threads for reading IO and one to "touch the buffer" (i.e. do hardly anything).
To be clear the single threaded not being single threaded issue isn't discrediting the article, the benchmarks still show the problem it's that the 8 threaded conventional and 1 threaded async conventional are accidentally basically both 8 thraded.
I think the actual worst-case would be to read the pages in a (pseudo-)random order.
This doesn't defeat the purpose necessarily. How about for example, implementing a text editor: I want the best performance by loading the existing file initially (say it is <1MB), and the convenience and robustness of any writes to this memory being efficiently written to disk.
[0] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
- shadow register
- L1
- L2
- L3
- RAM
- GPU/SPU/RSP
- SSD
- Network
- HDD
The line is drawn depending on what you are doing and how.
Edit: moved Network above HDD. :-)
Related
Download Accelerator – Async Rust Edition
This post explores creating a download accelerator with async Rust, emphasizing its advantages over traditional methods. It demonstrates improved file uploads to Amazon S3 and provides code for parallel downloads.
Synchronous Core, Asynchronous Shell
A software architecture concept, "Synchronous Core, Asynchronous Shell," combines functional and imperative programming for clarity and testing. Rust faces challenges integrating synchronous and asynchronous parts, prompting suggestions for a similar approach.
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
Golang Sync Mutex: Normal and Starvation Mode
The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.
Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)
The article details a memory leak issue in a pricing engine using mimalloc, revealing that its internal bookkeeping caused memory retention. Restructuring to a single-threaded approach improved memory management.