July 14th, 2024

Java Virtual Threads: A Case Study

Java Virtual Threads are a new feature in Java for concurrency. A study by Liberty's team found they don't outperform autonomic thread pools in cloud-native workloads. Virtual threads show faster ramp-up but lower CPU-intensive workload throughput. Memory usage reduction is inconsistent. Collaboration with OpenJDK Community is ongoing.

Read original article

Java Virtual Threads have been introduced as a significant advancement in Java concurrent programming, aiming to provide a lightweight, scalable, and user-friendly concurrency model. However, a case study conducted by the Liberty performance engineering team found that virtual threads do not offer a clear advantage over Open Liberty's existing autonomic thread pool for typical cloud-native Java workloads. While virtual threads show quicker ramp-up time from idle to maximum throughput compared to the thread pool, they exhibit lower throughput for CPU-intensive workloads. The memory footprint of virtual threads may not always result in reduced memory usage due to various factors. Some unexpected performance issues were also identified, prompting collaboration with the OpenJDK Community for further investigation. The study evaluated performance metrics, including CPU throughput and ramp-up time, comparing Liberty's thread pool and virtual threads in various scenarios. The findings suggest that virtual threads may not necessarily enhance performance for CPU-intensive applications on a small number of CPUs, highlighting the importance of considering specific use cases when adopting this new Java feature.

Migrating from Java 8 to Java 17 II: Notable API Changes Since Java 8

The article details API changes in Java versions 9 to 17, emphasizing improvements for Java 8 migrations. Changes include null handling, performance enhancements, string improvements, switch expressions, record classes, and utility additions for developer productivity and code readability.

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

Atomicless Per-Core Concurrency

The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.

Java Structured Concurrency Is More Than ShutdownOnFailure

Java 21 introduces structured concurrency for managing parallel sub-tasks within specific scopes. EnhancedTaskScope offers features like throttling, circuit breakers, default values on failure, and Critical tasks identification. ListTaskScope aids list conversions. Custom features can be added for extended functionality. StructuredTaskScope executes tasks in virtual threads efficiently.

Free-threaded CPython is ready to experiment with

CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.

10 comments

By @pron - 10 months

Virtual threads do one thing: they allow creating lots of threads. This helps throughput due to Little's law [1]. But because this server here saturates the CPU with only a few threads (it doesn't do the fanout modern servers tend to do), this means that no significant improvements can be provided by virtual threads (or asynchronous programming, which operates on the same principle) while keeping everything else in the system the same, especially since everything else in that server was optimised for over two decades under the constraints of expensive threads (such as the deployment strategy to many small instances with little CPU).

So it looks like their goal was: try adopting a new technology without changing any of the aspects designed for an old technology and optimised around it.

[1]: https://youtu.be/07V08SB1l8c

By @cayhorstmann - 10 months

I looked at the replication instructions at https://github.com/blueperf/demo-vt-issues/tree/main, which reference this project: https://github.com/blueperf/acmeair-authservice-java/tree/ma...

What "CPU-intensive apps" did they test with? Surely not acmeair-authservice-java. A request does next to nothing. It authenticates a user and generates a token. I thought it at least connects to some auth provider, but if I understand it correctly, it just uses a test config with a single test user (https://openliberty.io/docs/latest/reference/config/quickSta...). Which would not be a blocking call.

If the request tasks don't block, this is not an interesting benchmark. Using virtual threads for non-blocking tasks is not useful.

So, let's hope that some of the tests were with tasks that block. The authors describe that a modest number of concurrent requests (< 10K) didn't show the increase in throughput that virtual threads promise. That's not a lot of concurrent requests, but one would expect an improvement in throughput once the number of concurrent requests exceeds the pool size. Except that may be hard to see because OpenLiberty's default is to keep spawning new threads (https://openliberty.io/blog/2019/04/03/liberty-threadpool-au...). I would imagine that in actual deployments with high concurrency, the pool size will be limited, to prevent the app from running out of memory.

If it never gets to the point where the number of concurrent requests significantly exceeds the pool size, this is not an interesting benchmark either.

By @pansa2 - 10 months

Are these Virtual Threads the feature that was previously known as “Project Loom”? Lightweight threads, more-or-less equivalent to Go’s goroutines?

By @exabrial - 10 months

What is the virtual thread / event loop pattern seeking to optimize? Is it context switching?

A number of years ago I remember trying to have a sane discussion about “non blocking” and I remember saying “something” will block eventually no matter what… anything from the buffer being full on the NIC to your cpu being at anything less than 100%. Does it shake out to any real advantage?

By @bberrry - 10 months

I don't understand these benchmarks at all. How could it possibly take virtual threads 40-50 seconds to reach maximum throughput when getting a number of tasks submitted at once?

By @LinXitoW - 10 months

From my very limited exposure to virtual threads and the older solution (thread pools), the biggest hurdle was the extensive use of ThreadLocals by most popular libraries.

In one project I had to basically turn a reactive framework into a one thread per request framework, because passing around the MDC (a kv map of extra logging information) was a horrible pain. Getting it to actually jump ship from thread to thread AND deleting it at the correct time was basically impossible.

Has that improved yet?

By @davidtos - 10 months

I did some similar testing a few days ago[1]. Comparing platform threads to virtual threads doing API calls. They mention the right conditions like having high task delays, but it also depends on what the task is. Threads.sleep(1) performs better on virtual threads than platform threads but a rest call taking a few ms performs worse.

[1] https://davidvlijmincx.com/posts/virtual-thread-performance-...

By @taspeotis - 10 months

My rough understanding is that this is similar to async/await in .NET?

It’s a shame this article paints a neutral (or even negative) experience with virtual threads.

We rewrote a boring CRUD app that spent 99% of its time waiting the database to respond to be async/await from top-to-bottom. CPU and memory usage went way down on the web server because so many requests could be handled by far fewer threads.

By @tzahifadida - 10 months

Similarly the power of golang concurrent programming is that you write non-blocking code as you write normal code. You don't have to wrap it in functions and pollute the code but moreover, not every coder on the planet knows how to handle blocking code properly and that is the main advantage. Most programming languages can do anything the other languages can do. The problem is that not all coders can make use of it. This is why I see languages like golang as an advantage.

Java Virtual Threads: A Case Study

Related

Migrating from Java 8 to Java 17 II: Notable API Changes Since Java 8

Beating the L1 cache with value speculation (2021)

Atomicless Per-Core Concurrency

Java Structured Concurrency Is More Than ShutdownOnFailure

Free-threaded CPython is ready to experiment with

Related

Migrating from Java 8 to Java 17 II: Notable API Changes Since Java 8

Beating the L1 cache with value speculation (2021)

Atomicless Per-Core Concurrency

Java Structured Concurrency Is More Than ShutdownOnFailure

Free-threaded CPython is ready to experiment with