November 24th, 2024

Pushing AMD's Infinity Fabric to Its Limit

AMD's Infinity Fabric architecture has improved memory bandwidth and latency management in Zen 5 compared to Zen 4, with better performance under load and benefits from faster DDR5 memory.

Read original articleLink Icon
Pushing AMD's Infinity Fabric to Its Limit

AMD's Infinity Fabric architecture has been tested for memory latency and bandwidth performance across its Zen CPU generations. The testing involved running latency-sensitive applications alongside bandwidth-hungry threads to observe how they interact under load. The results indicated that certain AMD chips are sensitive to thread placement, leading to significant latency spikes depending on core affinity choices. The Infinity Fabric connects Core Complex Dies (CCDs) to an I/O die, allowing for high core counts and modular system designs. However, as bandwidth demands increase, latency can also rise dramatically, particularly when multiple threads contend for the same resources. The latest Zen 5 architecture shows improvements in managing bandwidth and latency, with better performance under high load conditions compared to Zen 4. The testing revealed that isolating bandwidth-intensive tasks to one CCD can help maintain lower latency for sensitive applications on another CCD. Overall, the findings suggest that AMD's architecture has evolved to better handle memory bandwidth while minimizing latency impacts, particularly with the introduction of faster DDR5 memory and improved traffic management policies in Zen 5.

- AMD's Infinity Fabric architecture allows for high core counts but can lead to latency spikes under heavy load.

- Thread placement and core affinity significantly affect latency performance in AMD chips.

- Zen 5 shows improved bandwidth and latency management compared to Zen 4, especially under high load.

- Isolating bandwidth-heavy tasks to one CCD can help maintain lower latency for sensitive applications.

- Faster DDR5 memory enhances overall performance in the latest AMD architectures.

Link Icon 5 comments
By @majke - 5 months
This has puzzled me for a while. The cited system has 2x89.6 GB/s bandwidth. But a single CCD can do at most 64GB/s of sequential reads. Are claims like "Apple Silicon having 400GB/s" meaningless? I understand a typical single logical CPU can't do more than 50-70GB/s, and it seems like a group of CPU's typically shares a mem controller which is similarly limited.

To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD?

By @cebert - 5 months
George’s detailed analysis always impresses me. I’m amazed with his attention to detail.
By @Agingcoder - 5 months
Proper thread placement and numa handling does have a massive impact on modern amd cpus - significantly more so than on Xeon systems. This might be anecdotal, but I’ve seen performance improve by 50% on some real world workloads.
By @AbuAssar - 5 months
Great deep dive into AMD's Infinity Fabric! The balance between bandwidth, latency, and clock speeds shows both clever engineering and limits under pressure. Makes me wonder how these trade-offs will evolve in future designs. Thoughts?