March 23rd, 2025

RDNA 4's “Out-of-Order” Memory Accesses

AMD's RDNA 4 architecture enhances memory handling with out-of-order accesses and multiple queues, improving performance, especially in ray tracing, though its advancements are considered evolutionary compared to competitors.

Read original articleLink Icon
RDNA 4's “Out-of-Order” Memory Accesses

AMD's RDNA 4 architecture introduces significant enhancements to its memory subsystem, particularly in handling out-of-order memory accesses. This new capability allows requests from different shader waves to be processed independently, addressing a limitation in RDNA 3 where memory requests were handled in a strict order, leading to false dependencies between waves. The testing conducted revealed that RDNA 4 effectively eliminates these cross-wave delays, allowing for improved performance, especially in workloads like ray tracing. The architecture also features multiple out-of-order queues for memory requests, enhancing the efficiency of memory access handling within a wave. This change allows threads to interleave different types of memory requests, improving overall throughput. While RDNA 4's memory management improvements are notable, they are seen as evolutionary rather than revolutionary, as similar techniques have been implemented in other GPU architectures from Intel and Nvidia. Overall, RDNA 4 represents a significant step forward for AMD's GPU memory subsystem, enhancing performance across various applications.

- RDNA 4 allows out-of-order memory accesses, improving performance by eliminating false dependencies.

- The architecture introduces multiple out-of-order queues for memory requests, enhancing efficiency.

- Improvements are particularly beneficial for ray tracing workloads, allowing simultaneous traversal and result handling.

- While significant, RDNA 4's enhancements are seen as evolutionary, with similar features present in other GPU architectures.

- The changes mark the most substantial update to AMD's GPU memory subsystem since the launch of RDNA in 2019.

Link Icon 6 comments
By @jauntywundrkind - 22 days
I've been super curious to see what was at stake here! This sounds better than I'd dared to hope for.

I kind of thought this was just gonna be some kind of deferred texture loading thing, help with streaming assets.

If it actually allows inter-warp sequencing, it sounds like it might possibly solve the chief complains supreme GUI master Raph Levien recently had in I want a good parallel computer, which so that even though we can dynamically add shaders & construct a dynamic workgraph (largely thanks to VK_AMDX_shader_enqueue?), there isn't any sequencing/fencing/barrier-ing between the sections. https://raphlinus.github.io/gpu/2025/03/21/good-parallel-com... https://news.ycombinator.com/item?id=43440174

Not applicable to GPUs, but since I ran into it recently, it's interesting to see how io_uring handles sequenced submissions. Here's Lord of io_uring's write-up, https://unixism.net/loti/tutorial/link_liburing.html#link-li...

Edit: having read the article more fully, I'm not sure this is about waves depending on each other. Maybe more about them trying to access memory. Apologies. Hopefully someday!

By @Terr_ - 22 days
At first glance at the title, I thought it was going to be about some twist on DNA 3' and DNA 5' reading frames.

https://en.wikipedia.org/wiki/Reading_frame

By @IshKebab - 21 days
Presumably this didn't matter hugely because the memory access patterns for each wave are going to be extremely similar anyway?

Ah yeah he says that at the end. Doesn't really matter for rasterisation but might make more of a difference for ray tracing.

By @shmerl - 22 days
Does AMD have its own flavor of GPU assembly and how is it called?