September 7th, 2024

Asynchronous IO: the next billion-dollar mistake?

Asynchronous IO enables multiple operations without blocking threads, addressing performance issues. The author questions its prioritization over improving OS thread efficiency, suggesting reliance will continue until threading models improve.

Read original articleLink Icon
Asynchronous IO: the next billion-dollar mistake?

Asynchronous IO, or non-blocking IO, allows applications to perform multiple IO operations without blocking the calling OS thread, addressing the C10K problem that arose with the increasing internet traffic in the late 1990s and early 2000s. While this technique has gained traction, with languages like Go and Erlang integrating it directly, and others like Rust relying on libraries, it presents challenges. Not all IO operations can be performed asynchronously, particularly file IO on Linux, necessitating alternative strategies. The author questions whether the focus on asynchronous IO over improving OS thread efficiency has been a mistake, likening it to Tony Hoare's critique of NULL pointers as a "billion-dollar mistake." The argument posits that if OS threads were more efficient, developers could simply use many threads for blocking operations, simplifying the programming model and reducing the need for complex asynchronous mechanisms. The current high cost of spawning OS threads and context switching complicates this, leading to a reliance on asynchronous IO. The author concludes that until a new operating system emerges that significantly enhances thread performance, the industry will remain dependent on asynchronous IO.

- Asynchronous IO allows handling multiple connections without blocking threads.

- The technique has become essential due to the limitations of OS thread performance.

- Not all IO operations can be performed asynchronously, particularly file IO.

- The author questions if the focus on asynchronous IO was a mistake compared to improving OS thread efficiency.

- Current reliance on asynchronous IO may persist until a more efficient threading model is developed.

Link Icon 58 comments
By @winternewt - 8 months
> Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance

I actually can't imagine how that would ever be accomplished at the OS level. The fact that each thread needs its own stack is an inherent limiter for efficiency, as switching stacks leads to cache misses. Asynchronous I/O has an edge because it only stores exactly as much state as it needs for its continuation, and multiple tasks can have their state in the same CPU cache line. The OS doesn't know nearly enough about your program to optimize the stack contents to only contain the state you need for the remainder of the thread.

But at the programming language level the compiler does have insight into the dependencies of your continuation, so it can build a closure that has only what it needs to have. You still have asynchronous I/O at the core but the language creates an abstraction that behaves like a synchronous threaded model, as seen in C#, Kotlin, etc. This doesn't come without challenges. For example, in Kotlin the debugger is unable to show contents of variables that are not needed further down in the code because they have already been removed from the underlying closure. But I'm sure they are solvable.

By @haileys - 8 months
Asynchronous IO isn't about efficiency.

The approach the author takes with their language is just threads, but scheduled in userland. This model allows a decoupling of the performance characteristics of runtime threads from OS threads - which can sometimes be beneficial - but essentially, the programming model is fundamentally still synchronous.

Asynchronous programming with async/await is about revealing the time dimension of execution as a first class concept. This allows more opportunities for composition.

Take cancellation for example: cancelling tasks under the synchronous programming model requires passing a context object through every part of your code that might call down into an IO operation. This context object is checked for cancellation at each point a task might block, and checked when a blocking operation is interrupted.

Timeouts are even trickier to do in this model, especially if your underlying IO only allows you to set per-operation timeouts and you're trying to expose a deadline-style interface instead.

Under the asynchronous model, both timeouts and cancellation simply compose. You take a future representing the work you're doing, and spawn a new future that completes after sleeping for some duration, or spawn a new future that waits on a cancel channel. Then you just race these futures. Take whichever completes first and cancel the other.

Having done a lot of programming under both paradigms, the synchronous model is so much more clunky and error-prone to work with and involves a lot of tedious manual work, like passing context objects around, that simply disappears under the asynchronous model.

By @HippoBaro - 8 months
I am not sure I buy the underlying idea behind this piece, that somehow a lot of money/time has been invested into asynchronous IO at the expense of thread performance (creation time, context switch time, scheduler efficiency, etc.).

First, significant work has been done in the kernel in that area simply because any gains there massively impact application performance and energy efficiency, two things the big kernel sponsors deeply care about.

Second, asynchronous IO in the kernel has actually been underinvested for years. Async disk IO did not exist at all for years until AIO came to be. And even that was a half-backed, awful API no one wanted to use except for some database people who needed it badly enough to be willing to put up with it. It's a somewhat recent development that really fast, genuinely async IO has taken center stage through io_uring and the likes of AF_XDP.

By @nasretdinov - 8 months
Somewhat controversial take: the current threads implementation is usually already performant enough for most use cases. The actual reason why we don't use them to handle more than a few thousand concurrent operations is that, at least in Linux, threads are scheduled and treated very similarly to processes. E.g. if a single process with 3000 threads gets bottlenecked on some syscall, etc, your system load average will become 3000, and it will essentially lead to no other processes being able to run well on the same machine.

Another issue with threads performance is that they are visible to most system tools like `ps`, and thus having too mamy threads starts to affect operations _outside_ the kernel, e.g. many monitoring tools, etc.

So that's the main reason why user-space scheduling became so popular: it hides the "threads" from the system, allowing for processes to be scheduled more fairly (preventing stuff like reaching LA 3000 when writing to 3000 parallel connections), and not affecting performance of the system infrastructure around the kernel.

BTW the threads stacks, as well as everything else in Linux are allocated lazily, so if you only use like 4Kb of stack in the thread it wouldn't lead to RSS of full 8M. It will contribute to VMEM, but not RSS

By @hinkley - 8 months
> More specifically, what if instead of spending 20 years developing various approaches to dealing with asynchronous IO (e.g. async/await), we had instead spent that time making OS threads more efficient, such that one wouldn't need asynchronous IO in the first place?

This is still living in an antiquated world where IO was infrequent and contained enough that one blocking call per thread still made you reasonable forward progress. When you’re making three separate calls and correlating the data between them having the entire thread blocked for each call is still problematic.

Linux can handle far more threads than Windows and it still employs io_uring. Why do you suppose that is?

One little yellow box about it is not enough to defend the thesis of this article.

By @BenoitP - 8 months
> Now imagine a parallel universe where instead of focusing on making asynchronous IO work

Funny choice of words. In the JVM world, Ron Pressler's first foray into fibers -quasar- was named "parallel universe". It worked with a java agent manipulating bytecode. Then Ron went to Oracle and now we have Loom, aka a virtual thread unmounted at each async IO request.

Java's Loom is not even mentioned in the article. I wonder for a cofounder: does the "parallel universe" appear in a other foundational paper, calling for a lightweight thread abstraction?

https://docs.paralleluniverse.co/quasar/

Anyway, yes we need sound abstractions for async IO

By @dwaite - 8 months
I don't quite agree with this piece, as it is comparing apples and oranges.

What you want is patterns for having safety, efficiency and maintainability for concurrent and parallelized processing.

One early pattern for doing that was codified as POSIX threads - continue the blocking processing patterns of POSIX so that you can have multiple parallelizable streams of execution with primitives to protect against simultaneous use of shared resources and data.

IO_URING is not such a pattern. It is a kernel API. You can try to use it directly, but you can also use it as one component in a userland thread systems, in actor systems, in structured concurrency systems, etc.

So the author is seemingly comparing the shipped pattern (threads) vs direct manipulation, and complaining that the direct manipulation isn't as safe or maintainable. It wasn't meant to be.

By @necovek - 8 months
To put a different spin on what others are saying, asynchronous IO is a different programming model for concurrency that's actually more ergonomic and easier to get right for an average developer (which includes great developers on their not-highly-focused days).

Dealing with raciness, deadlocks and starvation is simply hard, especially when you are focused on solving a different but also hard business problem.

That's also why RDBMSes had and continue to have such a success: they hide this complexity behind a few common patterns and a simple language.

Now, I do agree that languages that suffer from the "color of your functions" problems didn't get it right (Python, for instance). But ultimately, this is an easier mental model, and it's been present since the dawn of purely functional languages (nothing stops a Lisp implementation from doing async IO, and it might only be non-obvious how to do "cancellation" while "gather" is natural too)

By @p1necone - 8 months
Async/await is a language semantics thing. It's not really relevant whether there's a "real" OS thread under the hood, some language level green thread system, or just the current process blocking on something - the syntax exists because sometimes you don't want to block on things that take a long time semantically - I.e. you want the next line of code to run immediately.

You could absolutely write a language where the blocking on long running tasks was implicit and instead there was a keyword for when you don't want to block, but the programmer doesn't really need to care about the underlying threading system.

By @jeffreygoesto - 8 months
Some answer was already posted here:

https://utcc.utoronto.ca/~cks/space/blog/tech/OSThreadsAlway...

https://news.ycombinator.com/item?id=41472027

To me the article reads as if the programming language author wants to push a difficult problem out of his language without deeper analysis. As if it would be easier if it was somebody else's problem.

By @torginus - 8 months
Asynchronous IO is just simply how the world works. Instead, the idea that changes happen only during CPU computation is the mistake. Your disk drive/network card exists in parallel to your CPU and can process stuff concurrently. Your CPU very likely has a DMA engine that works in parallel without consuming a hardware thread.
By @sedatk - 8 months
Many synchronous I/O operations under the hood are just async I/O + blocking waits, at least that's the case with Windows. Why? Because all I/O is inherently async. Even polling I/O requires timed waits which also makes it async.

That said, I like async programming model in general, not just for I/O. It makes modeling your software as separetely flowing operations that need to be synchronized occasionally quite easy. Some tasks need to run in parallel? Then, you just wait for them later.

I also like the channel concept of Golang and D in the same manner, but I heard it brought up some problems that async/await model didn't have. Can't remember what it was now. Maybe they are more susceptible to race conditions? Not sure.

By @seanhunter - 8 months
> "Not only would this offer an easier mental model for developers..."

Translation: "I find async i/o confusing and all developers are like me".

This argument has been going on for over 20 years at this point. There are some people who think having pools of threads polling is a natural way of thinking about IO. They keep waiting for the day this becomes an efficient way to do IO.

By @CJefferson - 8 months
I generally agree with this article.

There are programs where async IO is great, but in my experience it stops being useful as your code “does more stuff”.

The few large scale async systems I’ve worked with end up with functions taking too long, so you use ability to spin off functions into threadpools, then async wait for their return, at which point you often end up with the worst of both threads and async.

By @alexgartrell - 8 months
> File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads.

AIO has existed for a long time. A lot longer than io_uring.

I think the thing that the author misses here is that the majority of IO that happens is actually interrupt driven in the first place, so async io is always going to be the more efficient approach.

The author also misses that scheduling threads efficiently from a kernel context is really hard. Async io also confers a benefit in terms of “data scheduling.” This is more relevant for workloads like memcached.

By @csomar - 8 months
> Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads such that one can easily use hundreds of thousands of OS threads without negatively impacting performance

Isn't that why async I/O was created in the first place?

> Just use 100 000 threads and let the OS handle it.

How does the OS handle it? How does the OS know whether to give it CPU time or not?

I was expecting something from the OP (like a new networking or multi-threading primitive) but I have a feeling he lacks an understanding of how networking and async I/O works.

By @Someone - 8 months
FTA: “Need to call a C function that may block the calling thread? Just run it on a separate thread, instead of having to rely on some sort of mechanism provided by the IO runtime/language to deal with blocking C function calls.”

And then? How do you know when your call completed without “some sort of mechanism provided by the IO runtime/language”? Yes, you periodically ask the OS whether that thread completed, but that doesn’t come for free and is far from elegant.

There are solutions. The cheapest, resource-wise, are I/O completion callbacks. That’s what ”System” had on the original Mac in 1984, and there likely were even smaller systems before that had them.

Easier for programmers would be something like what we now have with async/await.

It might not be the best option, but AFAICT, this article doesn’t propose a better one. Yes, firing off threads is easy, but getting the parts together the moment they’re all available isn’t.

By @nullindividual - 8 months
We do live in the universe of high performance threads with asynchronous I/O.

The author is looking for Windows NT.

By @pizza - 8 months
Correct me if I'm wrong, but Microsoft's DirectStorage seems to me something like what the author is writing about. It lets you do eg massively parallel NVME file io ops from the GPU itself of lots of small files. This avoids the delay of the path through the CPU, any extra threads/saturation of the CPU, and even lets you do eg decompression of game assets on the GPU itself thereby saving even more CPU. This demo benchmark shows DEFLATE going from 1 GB/s on CPU to 7 GBs/ on GPU https://github.com/microsoft/DirectStorage/tree/main/Samples...
By @jiggawatts - 8 months
There are fundamental reasons for OS threads being slow, mostly to do with processor design. Changing the silicon would be hundreds of times more expensive than solving the problem in software in user mode.

This is a billion-dollar solution to a hundred-billion dollar problem.

By @chucke - 8 months
My interpretation of what the author wants, is essentially lightweight threads in the kernel, standardised a lá POSIX , that every proglang could use as a primitive.

That'd be sweet if this were a well understood problem. Unfortunately, we're still finding the sweet spot between I/O Cs CPU bound tasks, "everything is a file" clashing with async network APIs and mostly sync file APIs, and sending that research to the kernel would mean having improvements widely distributed in 5 years or more, and would set back the industry decades, if not centuries. We learned this much already with the history of TCP and the decision of keeping QUIC in userspace.

By @_davide_ - 8 months
What about memory? the real price of threads is the stack.

Even when perfectly optimized, it wouldn't be enough to handle serious workloads.

By @pyrolistical - 8 months
What if the author’s proposed solution is the billion dollar mistake?

IMO the best programming paradigms are when the abstractions are close to the hardware.

Instead of pretending to have unlimited cores, what if as part of the runtime of we are given the exactly one thread per core. As the programmer we are responsible for utilizing all the cores and passing data around.

It is then up to the operating system to switch entire sets of cores over different processes.

This removes the footgun of a process overloading a computer with too many threads. Programmers need to consider how to best distribute work over a finite number of cores.

By @Szpadel - 8 months
I think async vs threads is about completely different trade-off. Nowadays all operating systems so preemptive scheduling, but green threads (and all async by the extent) use cooperative scheduling. I believe most of discussion here is actually about pros and cons of those scheduling models.

one exception is I think cancellation model, but I'm only aware about rust that does it that way, all other runtimes will happily run your green thread until it finishes or cancel by itself similarly that you do with synchronous code.

By @orf - 8 months
Computers are inherently asynchronous, we just plaster over that with synchronous interfaces.

We put a lot of effort into maintaining these synchronous facades - from superscalar CPUs translating assembly instructions into “actual” instructions and speculatively executing them in parallel to prevent stalls, to the kernel with preemptive scheduling, threads and their IO interfaces, right up to user-space and the APIs they provide on top of all this.

Surely there has to be a better way? It seems ridiculous.

By @mmis1000 - 8 months
I think compare programing pattern(threads) to kernel async apis is a questionable comparation.

The point of kernel async apis is not about letting programmers write system calls directly. It's about expose the actual async operations under the hook (it could be disk, be network, be anything outside of the computer case).

Those actions are never mean to be interleaved with cpu computation, because they are usually with ms level delay (which could be millions of cpu ticks). The kernel fakes these into sync calls by pause everything. But it isn't always the best idea to do these.

Let userland program decide what they want to do with the delay will be a way better idea. Even they eventually just invent blocking io calls again. They can still decide what operations are more relevant to itself instead of let the kernel guessing it.

By @bob1029 - 8 months
> the cost to start threads is lower, context switches are cheaper, etc.

Physics would have a word with this one. We are already pushing limits of what is possible with latency between cores vs overall system performance. There isn't an order of magnitude improvement hiding in there anywhere without some FTL communication breakthrough. In theory, yes we could sweep this problem under the rug of magical, almost-free threads. But these don't really exist.

I think the best case for performance is to accumulate mini batches of pending IO somewhere and then handle them all at once. IO models based upon ring buffers are probably getting close to the theoretical ideal when considering how our CPUs work internally (cache coherency, pipelining, etc).

By @josefrichter - 8 months
Isn’t that parallel universe actually the Erlang BEAM?
By @lima - 8 months
By @scarnie - 8 months
The author appears to contradict the very issue they argue, by presenting languages such as Go, Erlang or their own toy language. These languages hide the async / await constructs that are present in languages like Rust, Swift or Typescript. The former languages and runtimes have no function colouring problems, when working within their own SDKs. There are trade offs, and these “async” languages tend to be a bit more awkward when interacting with OS frameworks.
By @GolDDranks - 8 months
Then again, threads are a primitive API for concurrency. I hope that structured concurrency becomes more common. How they are inplemented should be more or less an implementation detail. https://vorpus.org/blog/notes-on-structured-concurrency-or-g...
By @weinzierl - 8 months
"Not every IO operation can be performed asynchronously though. File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads."

Can someone explain, why this would be the case?

- Why can't every IO op be async?

- Why is file IO on Linux not async?

- What does iouring have to do with it?

By @sph - 8 months
Async IO is hilarious in a universe where Hewitt's actors or even Hoare's CSP exist.

They are a subpar technique that works well with languages with semantics from the 1970s that do not have communication primitives, in the age of multicore and the Internet.

The saddest thing is the most hyped language of the decade went all in with this miserable idea, and turned me completely off the ecosystem.

By @algobro - 8 months
The 1990s called. They want their threads vs events debates back.

Todays processors are fast enough to serve many useful workloads with a single core. The benefit of the async abstraction outweighs the performance benefit in the majority of the cases.

And debugging multithreaded code is way harder than async code, mainly if its the kind of program that needs stepping into.

By @ivanjermakov - 8 months
Async IO is not only about creating sockets and spawning threads. Idea of async IO is that the world is not controlled by your CPU. There are network, storage, sound devices that might and will take time to produce the result and the CPU has to wait for it.

I feel like there is a big misunderstanding about what async IO is and what problem it solves.

By @dfgdfg34545456 - 8 months
The post seems to be assuming that multi threaded code is easy to build and maintain. From my experience it is horrible, every new thread means going from n bugs to nn bugs. As a programmer I prefer* async constructs in languages, and do not want to spin up and manage threads and all the state synchronisation that involves.
By @anonymoushn - 8 months
It's not clear that context switches can be made sufficiently cheap on fast CPUs without disabling mitigations for side-channel attacks. So the idea of making OS threads comparably performant to goroutines, rust async, or any implementation of cooperative multithreading seems impractical.
By @nurettin - 8 months
In several projects I switched from async code to threads and mpsc queues. Timers, data streams, external communications all run in their threads. They pass their messages to the main thread's queue. The entire thing suddenly became much easier to reason about and read.
By @neonsunset - 8 months
No, it isn’t, the author is just confused, as it usually is.

Worked great in C# since its introduction for task interleaving, composition, cancellation (worse languages call it structured concurrency) and switching to privileged context (UI thread), and will work even better in .NET 10.

By @praptak - 8 months
There's an effort by Google to address this with userspace threads. I think it has stalled though.

HN discussion thereof: https://news.ycombinator.com/item?id=23964633

By @quercus - 8 months
I thought this article was going to talk about async IO's potential to cause insidious bugs that you don't notice until they're crashing jet planes. That's definitely a topic worth of a blog post.
By @ruduhudi - 8 months
With async IO you can do stuff concurrently without doing it in parallel which is a desirable thing for many workloads. With only threads you‘d need sync primitives all over the place.
By @Brian_K_White - 8 months
Take the title at face value. If it were a sincere question the sincere answer is simply no.

The world actually is concurrent and asynchronous regardless how inconvenient that is for a programmer.

By @snickerbockers - 8 months
This is extremely myopic. there is not a 1:1 correspondence between using asynchronous io and using one thread per file. Asynchronous io lets you dodge thread safety mechanisms like semaphores.

Not all of us are trying to write a webapp or whatever, some of us just need to load a lot of data from several descriptors without serializing all the blocking operations.

>Not every IO operation can be performed asynchronously though. File IO is perhaps the best example of this (at least on Linux). To handle such cases, languages must provide some sort of alternative strategy such as performing the work in a dedicated pool of OS threads.

Uhhh this is just wrong file io can definitely be done asynchronously, on Linux, and without language support.

By @rurban - 8 months
On the contrary blocking IO (and POSIX) is the old billion dollar mistake.

It's inefficient and blocks concurrency safety.

By @dudeinjapan - 8 months
This article is a billion dollar mistake.
By @NinoScript - 8 months
Interesting take. I’d like to see such an OS, I wonder what requirements/limitations it would have.
By @forty - 8 months
Maybe just use Nodejs? Everything is async by default and the API is not annoying or difficult to use
By @anonymoushn - 8 months
Do Mac and Windows really fail to expose some file I/O operations with kqueue and IOCP?
By @stuaxo - 8 months
20 years ago we had the world of threads and it was very easy to get into a mess.
By @spullara - 8 months
Synchronous IO has always been more efficient. Anyone that thought otherwise doesn't understand how complicated context switches are in CPUs. The benefit of async io has always been handling tons of idle connections.
By @throwaway81523 - 8 months
File io can now be done asynchronously with io_uring.
By @a-dub - 8 months
you should be able to reason synchronously and a good computer system would handle the rest.
By @OutOfHere - 8 months
> Languages such as Go and Erlang bake support for asynchronous IO directly into the language, while others such as Rust rely on third-party libraries such as Tokio.

This is so wrong. Go and Erlang have message passing, not async. Message passing is its own thing; it should not be mixed with threading or async.