August 21st, 2024

Async hazard: MMAP is blocking IO

Memory-mapped I/O can cause blocking I/O in asynchronous programming, leading to performance issues. Conventional I/O methods outperform it unless data is cached in memory, highlighting risks in concurrent applications.

Read original article

CuriosityCautionFrustration

Memory mapping files for reading can simplify file access but introduces significant performance issues in asynchronous programming. Huon Wilson's experiments reveal that using memory-mapped I/O with async/await in Rust leads to blocking I/O, causing concurrent code to execute sequentially. This results in slower performance, underutilization of resources, and increased latency. Benchmarks conducted on an M1 MacBook Pro demonstrated that while memory-mapped I/O with async/await showed no concurrency improvement, conventional I/O methods performed significantly better. The underlying issue stems from how operating systems handle memory-mapped files, which can block threads during data loading from disk. This blocking behavior disrupts the cooperative scheduling of async tasks, preventing the executor from switching to other tasks. However, when data is already cached in memory, memory-mapped I/O can outperform conventional I/O due to reduced overhead. The findings suggest that while memory-mapped I/O offers a convenient API, it poses risks in concurrent environments, particularly when data is not readily available in memory.

- Memory-mapped I/O can lead to blocking I/O in async programming, causing performance issues.

- Benchmarks show that conventional I/O methods outperform memory-mapped I/O in async contexts.

- Blocking behavior occurs when data is not cached in memory, disrupting async task scheduling.

- Memory-mapped I/O can be faster than conventional I/O when data is already cached.

- Caution is advised when using memory-mapped files in concurrent applications.

Download Accelerator – Async Rust Edition

This post explores creating a download accelerator with async Rust, emphasizing its advantages over traditional methods. It demonstrates improved file uploads to Amazon S3 and provides code for parallel downloads.

Synchronous Core, Asynchronous Shell

A software architecture concept, "Synchronous Core, Asynchronous Shell," combines functional and imperative programming for clarity and testing. Rust faces challenges integrating synchronous and asynchronous parts, prompting suggestions for a similar approach.

Atomicless Per-Core Concurrency

The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.

Golang Sync Mutex: Normal and Starvation Mode

The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.

Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)

The article details a memory leak issue in a pricing engine using mimalloc, revealing that its internal bookkeeping caused memory retention. Restructuring to a single-threaded approach improved memory management.

AI: What people are saying

The discussion around memory-mapped I/O (mmap) reveals several key insights and concerns regarding its use in asynchronous programming.

Many commenters acknowledge mmap's power and utility but caution that it requires expert knowledge due to its complexity and potential performance pitfalls.
There are concerns about blocking behavior in async frameworks, with some suggesting that traditional threading may be more reliable.
Commenters highlight the risks of unexpected page faults and the need for robust error handling when using mmap.
Some argue that the benchmarks presented in the article may not accurately reflect real-world scenarios, particularly regarding thread usage.
References to external literature and previous studies indicate a broader context of ongoing discussions about mmap's implications in system design.

15 comments

By @mjb - 8 months

I like this point - it's no secret that mmap can make memory access cost the same as an IO (swap can too) - but the interaction with async schedulers isn't immediately obvious. The cost can, sometimes, be even higher than this post says, because of write back behavior in Linux.

Mmap is an interesting tool for system builders. It's super powerful, and super useful. But it's also kind of dangerous because the gap between happy case and worst case performance is so large. That makes benchmarking hard, adds to the risk of stability bugs, and complicates taming tail latency. It's behavior also varies a lot between OSs.

It's also nice to see all the data in this post. Too many systems design conversations are just dueling assertions.

By @correnos - 8 months

IMO this is a strong argument for proper threads over async: you can try and guess what will and won't block as an async framework dev, but you'll never fully match reality and you end up wasting resources when an executor blocks when you weren't expecting.

By @akira2501 - 8 months

> How do other mmap/madvise options influence this (for instance, MADV_SEQUENTIAL, MADV_WILLNEED, MADV_POPULATE, MADV_POPULATE_READ, mlock)? (Hypothesis: these options will make it more likely that data is pre-cached and thus fall into fast path more often, but without a guarantee.)

That probably should have been the first thing to try. Too mad the mmap2 crate does not expose this.

Also looking at the mmap2 crate, it chooses some rather opinionated defaults depending on which function you actually call, and it makes accessing things like HUGEPAGE maps somewhat difficult.. and for whatever reason includes the MMAP_STACK flag when you call through this path.

I feel like a lot of rust authors put faith in crates that, upon inspection, are generally poorly designed and do not expose the underlying interface properly. It's a bad crutch for the language.

By @malkia - 8 months

WIth mmap you have to be prepared to handle unexpected page fault errors due to corrupted volume: Unlike standard read/write, where one can handle the issue, now it can happen anywhere the memory is mapped - your code, third party library, etc.

It gets even unwieldy, and now you have to add additional tracking where access is to be expected. Blindly delegating mmap area to any code path that does not have such handling, and you would have to deal with these failures.

Maybe that's not the case on Linux/OSX/BSD, but definitely is on Windows where you would have it. Also in C/C++ land you have to handle this using SEH - e.g. `__try/__except` - standard C++ handling won't cut it (I guess in other systems these would be through some signals (?)).

In any case, it might seem like an easy path to achieve glory, yet riddled with complications.

By @dathinab - 8 months

While the general point the article is making is correct there are some issues.

- (minor issue) async example is artificially limited to 1 thread (the article states that). The issue is comparing 8 OS threads no async to 1 thread async is fundamentally not very useful as long as you didn't pin all threads to the same physical core.. So in general you should compare something async with num_cpus threads vs. num_cpus*X OS threads. Through this wouldn't have been that useful in this example without pinning the tokio async threads to CPUs to forcefully highlight the page issue, and doing it is bothersome so I wouldn't have done so either.

- (bigger issue) The singled thread async "traditional IO" example is NOT single threaded. Async _file_ IO anything between not a thing or very bad in most OSes hence most async engines including tokio do file IO in worker threads. This means the "single threaded" conventional IO async example is running 8 threads for reading IO and one to "touch the buffer" (i.e. do hardly anything).

To be clear the single threaded not being single threaded issue isn't discrediting the article, the benchmarks still show the problem it's that the 8 threaded conventional and 1 threaded async conventional are accidentally basically both 8 thraded.

By @Retr0id - 8 months

> This is thus a worst case, the impact on real code is unlikely to be quite this severe!

I think the actual worst-case would be to read the pages in a (pseudo-)random order.

By @davesque - 8 months

I always thought that one of the use cases of memory mapping was to improve multiprocessing workloads, where a group of processes don't have to duplicate the same region of a working set. In that sense, maybe it's not surprising that single-threaded concurrency can't leverage all of the benefits of memory mapping.

By @dsp_person - 8 months

> One possible implementation might be to literally have the operating system allocate a chunk of physical memory and load the file into it, byte by byte, right when mmap is called… but this is slow, and defeats half the magic of memory mapped IO: manipulating files without having to pull them into memory

This doesn't defeat the purpose necessarily. How about for example, implementing a text editor: I want the best performance by loading the existing file initially (say it is <1MB), and the convenience and robustness of any writes to this memory being efficiently written to disk.

By @dekhn - 8 months

I used to really like mmap for a wide range of uses (having noticed its performance in the BLAST DNA/protein search command) but over time I've come to consider it a true expert tool with deep subtlety, like a palantir.

By @rmholt - 8 months

While author said that C's mmap suffers the same issue, I would argue C's mmap is fine, because C doesn't have async. The issue arises from the mmap crate not having an async read and the confusion around how does async work.

By @colonwqbang - 8 months

Function calls are also blocking IO then because executables and libraries are mmapped.

By @PaulHoule - 8 months

No secret. Reading from memory is synchronous and always has been, at least in a normal computer. (Sometimes I think of how you could fit a fancy memory controller in a transport triggered architecture but that’s something different)

By @charleshn - 8 months

See also the classic "Are You Sure You Want to Use MMAP in Your Database Management System?" which mentions this common pitfall of mmap, and others, in the context of DBMS.

[0] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

By @lowbloodsugar - 8 months

- register

- shadow register

- L1

- L2

- L3

- RAM

- GPU/SPU/RSP

- SSD

- Network

- HDD

The line is drawn depending on what you are doing and how.

Edit: moved Network above HDD. :-)

By @pengaru - 8 months

water is wet

Download Accelerator – Async Rust Edition

Synchronous Core, Asynchronous Shell

Atomicless Per-Core Concurrency

Golang Sync Mutex: Normal and Starvation Mode

The article explains the use of sync.Mutex in Go to prevent race conditions, detailing operations like Lock and Unlock, and discussing Normal and Starvation modes for effective concurrency control.

Async hazard: MMAP is blocking IO

Related

Download Accelerator – Async Rust Edition

Synchronous Core, Asynchronous Shell

Atomicless Per-Core Concurrency

Golang Sync Mutex: Normal and Starvation Mode

Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)

Related

Download Accelerator – Async Rust Edition

Synchronous Core, Asynchronous Shell

Atomicless Per-Core Concurrency

Golang Sync Mutex: Normal and Starvation Mode

Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)