Java Virtual Threads: A Case Study
Java Virtual Threads are a new feature in Java for concurrency. A study by Liberty's team found they don't outperform autonomic thread pools in cloud-native workloads. Virtual threads show faster ramp-up but lower CPU-intensive workload throughput. Memory usage reduction is inconsistent. Collaboration with OpenJDK Community is ongoing.
Read original articleJava Virtual Threads have been introduced as a significant advancement in Java concurrent programming, aiming to provide a lightweight, scalable, and user-friendly concurrency model. However, a case study conducted by the Liberty performance engineering team found that virtual threads do not offer a clear advantage over Open Liberty's existing autonomic thread pool for typical cloud-native Java workloads. While virtual threads show quicker ramp-up time from idle to maximum throughput compared to the thread pool, they exhibit lower throughput for CPU-intensive workloads. The memory footprint of virtual threads may not always result in reduced memory usage due to various factors. Some unexpected performance issues were also identified, prompting collaboration with the OpenJDK Community for further investigation. The study evaluated performance metrics, including CPU throughput and ramp-up time, comparing Liberty's thread pool and virtual threads in various scenarios. The findings suggest that virtual threads may not necessarily enhance performance for CPU-intensive applications on a small number of CPUs, highlighting the importance of considering specific use cases when adopting this new Java feature.
Related
Migrating from Java 8 to Java 17 II: Notable API Changes Since Java 8
The article details API changes in Java versions 9 to 17, emphasizing improvements for Java 8 migrations. Changes include null handling, performance enhancements, string improvements, switch expressions, record classes, and utility additions for developer productivity and code readability.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
Java Structured Concurrency Is More Than ShutdownOnFailure
Java 21 introduces structured concurrency for managing parallel sub-tasks within specific scopes. EnhancedTaskScope offers features like throttling, circuit breakers, default values on failure, and Critical tasks identification. ListTaskScope aids list conversions. Custom features can be added for extended functionality. StructuredTaskScope executes tasks in virtual threads efficiently.
Free-threaded CPython is ready to experiment with
CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.
So it looks like their goal was: try adopting a new technology without changing any of the aspects designed for an old technology and optimised around it.
What "CPU-intensive apps" did they test with? Surely not acmeair-authservice-java. A request does next to nothing. It authenticates a user and generates a token. I thought it at least connects to some auth provider, but if I understand it correctly, it just uses a test config with a single test user (https://openliberty.io/docs/latest/reference/config/quickSta...). Which would not be a blocking call.
If the request tasks don't block, this is not an interesting benchmark. Using virtual threads for non-blocking tasks is not useful.
So, let's hope that some of the tests were with tasks that block. The authors describe that a modest number of concurrent requests (< 10K) didn't show the increase in throughput that virtual threads promise. That's not a lot of concurrent requests, but one would expect an improvement in throughput once the number of concurrent requests exceeds the pool size. Except that may be hard to see because OpenLiberty's default is to keep spawning new threads (https://openliberty.io/blog/2019/04/03/liberty-threadpool-au...). I would imagine that in actual deployments with high concurrency, the pool size will be limited, to prevent the app from running out of memory.
If it never gets to the point where the number of concurrent requests significantly exceeds the pool size, this is not an interesting benchmark either.
A number of years ago I remember trying to have a sane discussion about “non blocking” and I remember saying “something” will block eventually no matter what… anything from the buffer being full on the NIC to your cpu being at anything less than 100%. Does it shake out to any real advantage?
In one project I had to basically turn a reactive framework into a one thread per request framework, because passing around the MDC (a kv map of extra logging information) was a horrible pain. Getting it to actually jump ship from thread to thread AND deleting it at the correct time was basically impossible.
Has that improved yet?
[1] https://davidvlijmincx.com/posts/virtual-thread-performance-...
It’s a shame this article paints a neutral (or even negative) experience with virtual threads.
We rewrote a boring CRUD app that spent 99% of its time waiting the database to respond to be async/await from top-to-bottom. CPU and memory usage went way down on the web server because so many requests could be handled by far fewer threads.
Related
Migrating from Java 8 to Java 17 II: Notable API Changes Since Java 8
The article details API changes in Java versions 9 to 17, emphasizing improvements for Java 8 migrations. Changes include null handling, performance enhancements, string improvements, switch expressions, record classes, and utility additions for developer productivity and code readability.
Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
Java Structured Concurrency Is More Than ShutdownOnFailure
Java 21 introduces structured concurrency for managing parallel sub-tasks within specific scopes. EnhancedTaskScope offers features like throttling, circuit breakers, default values on failure, and Critical tasks identification. ListTaskScope aids list conversions. Custom features can be added for extended functionality. StructuredTaskScope executes tasks in virtual threads efficiently.
Free-threaded CPython is ready to experiment with
CPython 3.13 introduces free-threading to enhance performance by allowing parallel threads without the GIL. Challenges like thread-safety and ABI compatibility are being addressed for future adoption as the default build.