Asynchronous IO: the next billion-dollar mistake?
Asynchronous IO enables multiple operations without blocking threads, addressing performance issues. The author questions its prioritization over improving OS thread efficiency, suggesting reliance will continue until threading models improve.
Read original articleAsynchronous IO, or non-blocking IO, allows applications to perform multiple IO operations without blocking the calling OS thread, addressing the C10K problem that arose with the increasing internet traffic in the late 1990s and early 2000s. While this technique has gained traction, with languages like Go and Erlang integrating it directly, and others like Rust relying on libraries, it presents challenges. Not all IO operations can be performed asynchronously, particularly file IO on Linux, necessitating alternative strategies. The author questions whether the focus on asynchronous IO over improving OS thread efficiency has been a mistake, likening it to Tony Hoare's critique of NULL pointers as a "billion-dollar mistake." The argument posits that if OS threads were more efficient, developers could simply use many threads for blocking operations, simplifying the programming model and reducing the need for complex asynchronous mechanisms. The current high cost of spawning OS threads and context switching complicates this, leading to a reliance on asynchronous IO. The author concludes that until a new operating system emerges that significantly enhances thread performance, the industry will remain dependent on asynchronous IO.
- Asynchronous IO allows handling multiple connections without blocking threads.
- The technique has become essential due to the limitations of OS thread performance.
- Not all IO operations can be performed asynchronously, particularly file IO.
- The author questions if the focus on asynchronous IO was a mistake compared to improving OS thread efficiency.
- Current reliance on asynchronous IO may persist until a more efficient threading model is developed.
Related
Synchronous Core, Asynchronous Shell
A software architecture concept, "Synchronous Core, Asynchronous Shell," combines functional and imperative programming for clarity and testing. Rust faces challenges integrating synchronous and asynchronous parts, prompting suggestions for a similar approach.
Synchronous Core, Asynchronous Shell
Gary Bernhardt proposed a Synchronous Core, Asynchronous Shell model in software architecture, blending functional and imperative programming. Rust faces challenges integrating sync and async functions, leading to a trend of adopting this model for clarity and control.
I avoid async/await in JavaScript
Cory argues that async/await complicates JavaScript code, obscures optimization opportunities, increases cognitive load in error handling, and suggests promises are cleaner and more manageable for asynchronous operations.
Async hazard: MMAP is blocking IO
Memory-mapped I/O can cause blocking I/O in asynchronous programming, leading to performance issues. Conventional I/O methods outperform it unless data is cached in memory, highlighting risks in concurrent applications.
Async2 – The .NET Runtime Async experiment concludes
The .NET team's async2 experiment aims to enhance async/await efficiency by shifting management to the runtime, improving performance and exception handling, but may take years to become production-ready.
I actually can't imagine how that would ever be accomplished at the OS level. The fact that each thread needs its own stack is an inherent limiter for efficiency, as switching stacks leads to cache misses. Asynchronous I/O has an edge because it only stores exactly as much state as it needs for its continuation, and multiple tasks can have their state in the same CPU cache line. The OS doesn't know nearly enough about your program to optimize the stack contents to only contain the state you need for the remainder of the thread.
But at the programming language level the compiler does have insight into the dependencies of your continuation, so it can build a closure that has only what it needs to have. You still have asynchronous I/O at the core but the language creates an abstraction that behaves like a synchronous threaded model, as seen in C#, Kotlin, etc. This doesn't come without challenges. For example, in Kotlin the debugger is unable to show contents of variables that are not needed further down in the code because they have already been removed from the underlying closure. But I'm sure they are solvable.
The approach the author takes with their language is just threads, but scheduled in userland. This model allows a decoupling of the performance characteristics of runtime threads from OS threads - which can sometimes be beneficial - but essentially, the programming model is fundamentally still synchronous.
Asynchronous programming with async/await is about revealing the time dimension of execution as a first class concept. This allows more opportunities for composition.
Take cancellation for example: cancelling tasks under the synchronous programming model requires passing a context object through every part of your code that might call down into an IO operation. This context object is checked for cancellation at each point a task might block, and checked when a blocking operation is interrupted.
Timeouts are even trickier to do in this model, especially if your underlying IO only allows you to set per-operation timeouts and you're trying to expose a deadline-style interface instead.
Under the asynchronous model, both timeouts and cancellation simply compose. You take a future representing the work you're doing, and spawn a new future that completes after sleeping for some duration, or spawn a new future that waits on a cancel channel. Then you just race these futures. Take whichever completes first and cancel the other.
Having done a lot of programming under both paradigms, the synchronous model is so much more clunky and error-prone to work with and involves a lot of tedious manual work, like passing context objects around, that simply disappears under the asynchronous model.
First, significant work has been done in the kernel in that area simply because any gains there massively impact application performance and energy efficiency, two things the big kernel sponsors deeply care about.
Second, asynchronous IO in the kernel has actually been underinvested for years. Async disk IO did not exist at all for years until AIO came to be. And even that was a half-backed, awful API no one wanted to use except for some database people who needed it badly enough to be willing to put up with it. It's a somewhat recent development that really fast, genuinely async IO has taken center stage through io_uring and the likes of AF_XDP.
Another issue with threads performance is that they are visible to most system tools like `ps`, and thus having too mamy threads starts to affect operations _outside_ the kernel, e.g. many monitoring tools, etc.
So that's the main reason why user-space scheduling became so popular: it hides the "threads" from the system, allowing for processes to be scheduled more fairly (preventing stuff like reaching LA 3000 when writing to 3000 parallel connections), and not affecting performance of the system infrastructure around the kernel.
BTW the threads stacks, as well as everything else in Linux are allocated lazily, so if you only use like 4Kb of stack in the thread it wouldn't lead to RSS of full 8M. It will contribute to VMEM, but not RSS
This is still living in an antiquated world where IO was infrequent and contained enough that one blocking call per thread still made you reasonable forward progress. When you’re making three separate calls and correlating the data between them having the entire thread blocked for each call is still problematic.
Linux can handle far more threads than Windows and it still employs io_uring. Why do you suppose that is?
One little yellow box about it is not enough to defend the thesis of this article.
Funny choice of words. In the JVM world, Ron Pressler's first foray into fibers -quasar- was named "parallel universe". It worked with a java agent manipulating bytecode. Then Ron went to Oracle and now we have Loom, aka a virtual thread unmounted at each async IO request.
Java's Loom is not even mentioned in the article. I wonder for a cofounder: does the "parallel universe" appear in a other foundational paper, calling for a lightweight thread abstraction?
https://docs.paralleluniverse.co/quasar/
Anyway, yes we need sound abstractions for async IO
What you want is patterns for having safety, efficiency and maintainability for concurrent and parallelized processing.
One early pattern for doing that was codified as POSIX threads - continue the blocking processing patterns of POSIX so that you can have multiple parallelizable streams of execution with primitives to protect against simultaneous use of shared resources and data.
IO_URING is not such a pattern. It is a kernel API. You can try to use it directly, but you can also use it as one component in a userland thread systems, in actor systems, in structured concurrency systems, etc.
So the author is seemingly comparing the shipped pattern (threads) vs direct manipulation, and complaining that the direct manipulation isn't as safe or maintainable. It wasn't meant to be.
Dealing with raciness, deadlocks and starvation is simply hard, especially when you are focused on solving a different but also hard business problem.
That's also why RDBMSes had and continue to have such a success: they hide this complexity behind a few common patterns and a simple language.
Now, I do agree that languages that suffer from the "color of your functions" problems didn't get it right (Python, for instance). But ultimately, this is an easier mental model, and it's been present since the dawn of purely functional languages (nothing stops a Lisp implementation from doing async IO, and it might only be non-obvious how to do "cancellation" while "gather" is natural too)
You could absolutely write a language where the blocking on long running tasks was implicit and instead there was a keyword for when you don't want to block, but the programmer doesn't really need to care about the underlying threading system.
https://utcc.utoronto.ca/~cks/space/blog/tech/OSThreadsAlway...
https://news.ycombinator.com/item?id=41472027
To me the article reads as if the programming language author wants to push a difficult problem out of his language without deeper analysis. As if it would be easier if it was somebody else's problem.
That said, I like async programming model in general, not just for I/O. It makes modeling your software as separetely flowing operations that need to be synchronized occasionally quite easy. Some tasks need to run in parallel? Then, you just wait for them later.
I also like the channel concept of Golang and D in the same manner, but I heard it brought up some problems that async/await model didn't have. Can't remember what it was now. Maybe they are more susceptible to race conditions? Not sure.
Translation: "I find async i/o confusing and all developers are like me".
This argument has been going on for over 20 years at this point. There are some people who think having pools of threads polling is a natural way of thinking about IO. They keep waiting for the day this becomes an efficient way to do IO.
There are programs where async IO is great, but in my experience it stops being useful as your code “does more stuff”.
The few large scale async systems I’ve worked with end up with functions taking too long, so you use ability to spin off functions into threadpools, then async wait for their return, at which point you often end up with the worst of both threads and async.
AIO has existed for a long time. A lot longer than io_uring.
I think the thing that the author misses here is that the majority of IO that happens is actually interrupt driven in the first place, so async io is always going to be the more efficient approach.
The author also misses that scheduling threads efficiently from a kernel context is really hard. Async io also confers a benefit in terms of “data scheduling.” This is more relevant for workloads like memcached.
Isn't that why async I/O was created in the first place?
> Just use 100 000 threads and let the OS handle it.
How does the OS handle it? How does the OS know whether to give it CPU time or not?
I was expecting something from the OP (like a new networking or multi-threading primitive) but I have a feeling he lacks an understanding of how networking and async I/O works.
And then? How do you know when your call completed without “some sort of mechanism provided by the IO runtime/language”? Yes, you periodically ask the OS whether that thread completed, but that doesn’t come for free and is far from elegant.
There are solutions. The cheapest, resource-wise, are I/O completion callbacks. That’s what ”System” had on the original Mac in 1984, and there likely were even smaller systems before that had them.
Easier for programmers would be something like what we now have with async/await.
It might not be the best option, but AFAICT, this article doesn’t propose a better one. Yes, firing off threads is easy, but getting the parts together the moment they’re all available isn’t.
The author is looking for Windows NT.
This is a billion-dollar solution to a hundred-billion dollar problem.
That'd be sweet if this were a well understood problem. Unfortunately, we're still finding the sweet spot between I/O Cs CPU bound tasks, "everything is a file" clashing with async network APIs and mostly sync file APIs, and sending that research to the kernel would mean having improvements widely distributed in 5 years or more, and would set back the industry decades, if not centuries. We learned this much already with the history of TCP and the decision of keeping QUIC in userspace.
Even when perfectly optimized, it wouldn't be enough to handle serious workloads.
IMO the best programming paradigms are when the abstractions are close to the hardware.
Instead of pretending to have unlimited cores, what if as part of the runtime of we are given the exactly one thread per core. As the programmer we are responsible for utilizing all the cores and passing data around.
It is then up to the operating system to switch entire sets of cores over different processes.
This removes the footgun of a process overloading a computer with too many threads. Programmers need to consider how to best distribute work over a finite number of cores.
one exception is I think cancellation model, but I'm only aware about rust that does it that way, all other runtimes will happily run your green thread until it finishes or cancel by itself similarly that you do with synchronous code.
We put a lot of effort into maintaining these synchronous facades - from superscalar CPUs translating assembly instructions into “actual” instructions and speculatively executing them in parallel to prevent stalls, to the kernel with preemptive scheduling, threads and their IO interfaces, right up to user-space and the APIs they provide on top of all this.
Surely there has to be a better way? It seems ridiculous.
The point of kernel async apis is not about letting programmers write system calls directly. It's about expose the actual async operations under the hook (it could be disk, be network, be anything outside of the computer case).
Those actions are never mean to be interleaved with cpu computation, because they are usually with ms level delay (which could be millions of cpu ticks). The kernel fakes these into sync calls by pause everything. But it isn't always the best idea to do these.
Let userland program decide what they want to do with the delay will be a way better idea. Even they eventually just invent blocking io calls again. They can still decide what operations are more relevant to itself instead of let the kernel guessing it.
Physics would have a word with this one. We are already pushing limits of what is possible with latency between cores vs overall system performance. There isn't an order of magnitude improvement hiding in there anywhere without some FTL communication breakthrough. In theory, yes we could sweep this problem under the rug of magical, almost-free threads. But these don't really exist.
I think the best case for performance is to accumulate mini batches of pending IO somewhere and then handle them all at once. IO models based upon ring buffers are probably getting close to the theoretical ideal when considering how our CPUs work internally (cache coherency, pipelining, etc).
Slides: https://web.archive.org/web/20200802205544/https://pdxplumbe...
Can someone explain, why this would be the case?
- Why can't every IO op be async?
- Why is file IO on Linux not async?
- What does iouring have to do with it?
They are a subpar technique that works well with languages with semantics from the 1970s that do not have communication primitives, in the age of multicore and the Internet.
The saddest thing is the most hyped language of the decade went all in with this miserable idea, and turned me completely off the ecosystem.
Todays processors are fast enough to serve many useful workloads with a single core. The benefit of the async abstraction outweighs the performance benefit in the majority of the cases.
And debugging multithreaded code is way harder than async code, mainly if its the kind of program that needs stepping into.
I feel like there is a big misunderstanding about what async IO is and what problem it solves.
Worked great in C# since its introduction for task interleaving, composition, cancellation (worse languages call it structured concurrency) and switching to privileged context (UI thread), and will work even better in .NET 10.
HN discussion thereof: https://news.ycombinator.com/item?id=23964633
The world actually is concurrent and asynchronous regardless how inconvenient that is for a programmer.
Not all of us are trying to write a webapp or whatever, some of us just need to load a lot of data from several descriptors without serializing all the blocking operations.
>Not every IO operation can be performed asynchronously though. File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads.
Uhhh this is just wrong file io can definitely be done asynchronously, on Linux, and without language support.
It's inefficient and blocks concurrency safety.
This is so wrong. Go and Erlang have message passing, not async. Message passing is its own thing; it should not be mixed with threading or async.
Related
Synchronous Core, Asynchronous Shell
A software architecture concept, "Synchronous Core, Asynchronous Shell," combines functional and imperative programming for clarity and testing. Rust faces challenges integrating synchronous and asynchronous parts, prompting suggestions for a similar approach.
Synchronous Core, Asynchronous Shell
Gary Bernhardt proposed a Synchronous Core, Asynchronous Shell model in software architecture, blending functional and imperative programming. Rust faces challenges integrating sync and async functions, leading to a trend of adopting this model for clarity and control.
I avoid async/await in JavaScript
Cory argues that async/await complicates JavaScript code, obscures optimization opportunities, increases cognitive load in error handling, and suggests promises are cleaner and more manageable for asynchronous operations.
Async hazard: MMAP is blocking IO
Memory-mapped I/O can cause blocking I/O in asynchronous programming, leading to performance issues. Conventional I/O methods outperform it unless data is cached in memory, highlighting risks in concurrent applications.
Async2 – The .NET Runtime Async experiment concludes
The .NET team's async2 experiment aims to enhance async/await efficiency by shifting management to the runtime, improving performance and exception handling, but may take years to become production-ready.