Zen5's AVX512 Teardown and More
AMD's Zen5 architecture enhances AVX512 capabilities with native implementation, achieving 4 x 512-bit throughput, while facing thermal throttling challenges. It shows significant performance gains, especially in high-performance computing.
Read original articleZen5, AMD's latest processor architecture, introduces significant advancements in AVX512 capabilities, marking a notable improvement over its predecessor, Zen4. Unlike Zen4, which utilized a "double-pumping" method to handle 512-bit instructions through 256-bit hardware, Zen5 features a fully native implementation with expanded datapaths and execution units capable of 4 x 512-bit execution throughput. This leap allows Zen5 to outperform Intel in SIMD execution, a domain where Intel has historically led. However, the mobile variant, Strix Point, retains a 256-bit throughput, indicating a bifurcation in AMD's architecture. Performance metrics for Zen5 show varied IPC improvements across workloads, with a notable 96-98% increase in AVX512 performance, while other areas, such as 128-bit SSE, experienced regressions. The architecture also faces challenges with thermal throttling under sustained AVX512 workloads, although AMD's approach to throttling differs from Intel's, aiming to minimize negative impacts. Overall, Zen5's enhancements position AMD favorably against Intel, particularly in high-performance computing scenarios.
- Zen5 features a fully native AVX512 implementation, doubling the execution throughput compared to Zen4.
- The architecture bifurcates into different core types, with mobile variants maintaining 256-bit throughput.
- IPC improvements vary significantly across workloads, with AVX512 showing the most substantial gains.
- Zen5 experiences thermal throttling under sustained AVX512 workloads, but AMD's throttling method aims to reduce negative effects.
- AMD's advancements in SIMD execution mark a significant shift in competitive dynamics with Intel.
Related
The AMD Zen 5 Microarcitecure
AMD introduced Zen 5 CPU microarchitecture at Computex 2024, launching Ryzen AI 300 for mobile and Ryzen 9000 for desktops. Zen 5 offers improved IPC, dual-pipe fetch, and advanced branch prediction. Ryzen AI 300 includes XDNA 2 NPU and RDNA 3.5 graphics, while Ryzen 9000 supports up to 16 cores and 5.7 GHz boost clock.
A Video Interview with Mike Clark, Chief Architect of Zen at AMD
The interview with AMD's Chief Architect discussed Zen 5's enhancements like improved branch predictor and schedulers. It optimizes single-threaded and multi-threaded performance, focusing on compute capabilities and efficiency.
The AMD Zen 5 Microarchitecture
AMD revealed Zen 5 microarchitecture at Computex 2024, launching Ryzen AI 300 series for mobile and Ryzen 9000 series for desktop. Zen 5 brings enhanced performance with XDNA 2 NPU, RDNA 3.5 graphics, and 16% better IPC than Zen 4.
An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast
AMD's Mike Clark discusses Zen 5 architecture, covering 4nm and 3nm nodes. 4nm chips launch soon, with 3nm to follow. Zen 'c' cores may integrate into desktop processors. Zen 5 enhances Ryzen CPUs with full AVX-512 acceleration, emphasizing design balance for optimal performance.
Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks
AMD's Zen 5 architecture features a new 2-Ahead Branch Predictor Unit, enhancing instruction fetching and execution efficiency, particularly for x86 architectures, and significantly improving single-core performance.
It is an intrinsic problem with SIMD as we know it that you have to recode your application to support new instructions, which is a big hassle. Most people and companies will give up on supporting the latest and greatest and will ship binaries that meet the lowest common denominator. For instance it took forever for Microsoft to rely on instructions that were available almost 15 years ago.
As Charlie Demerjian has pointed out for years consumers are waking up to the fact that a 2024 craptop isn't much better than a 2014 crapbook and there is zero credibility in claims about "Ultrabooks", "AI PCs", etc. What could make a difference is a coordinated effort end-to-end to widely deploy the latest developments as quickly as possible across as much of the product line as possible, to get tooling support for them, and drive developers to adopt them as quickly as possible. As it is Intel will boast about how the national labs are blowing up H-bombs in VR faster than they ever had, and Facebook is profiling users more efficiently than before and not realize that customers don't believe these advances are going to make a difference for them so instead of buying a new PC which might deliver better performance when software (maybe) catches up in 7-8 years they are going to hold on to old machines longer.
On a completely different topic: I wasn't expecting the redacted portion of the article due to AMD's embargo to take away too much of the article but the first half (up until the discussion about AVX512) would clearly be much more interesting with the censored out parts. I guess someone will have to resubmit this come August 14th!
It looks like Zen5's support is essentially the dream - all EUs and load/store are expanded to 512 bit, so you can sustain 2 512 FMAs and 2 512 Adds every cycle. There also appears to be essentially no transition penalty to the full-power state which is incredible.
The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever. But if you can compile for your own targets, or JIT for it.. it looks really good.
> Hazards Fixed:
> V(P)COMPRESS store to memory is fixed. (3 cycles/store to non-overlapping addresses)
> The super-alignment hazard is fixed.
I initially tried searching for the string, but the () thwarted that.
It used to be 142 cycles/instruction for "vpcompressd [mem]{k}, zmm" in zen4.
His intelligence and openness, despite no one paying him for it, shines such a bad light on the terrible state of academia. That he was considered a "bad student" is near-proof in and of itself that our system judges people catastrophically poorly.
"Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really slow. (microcoded ~25 cycles/46 uops) It was so slow that someone found a better way to implement its functionality without using the instruction itself. Intel deprecates the instruction and removes it from all processors after Tiger Lake. (ignoring the fact that early Alder Lake unofficially also had it) AMD adds it to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up with it. Needless to say, Zen5 had probably already taped out by the time Intel deprecated the instruction. So VP2INTERSECT made it into Zen5's design and wasn't going to be removed.
But how good is AMD's implementation? Let's look at AIDA64's dumps for Granite Ridge:
AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg. set] T: 0.23ns= 1.00c
Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I just can't...
Intel was so bad at this that they dropped the instruction. And now AMD finally appears and shows them how it's done - 2 years too late."
While as he says, after Tiger Lake Intel has deprecated VP2INTERSECT, then they have changed their mind and they have added it again in the server CPU Granite Rapids, which will be launched in a few months.
Moreover, the ISA of Granite Rapids is considered to be AVX10.1 and all its instructions, including VP2INTERSECT, will be a mandatory part of the ISA of all future Intel CPUs from 2026 on.
Therefore it is good that AMD has achieved an excellent implementation of VP2INTERSECT, which they will be able to carry into their future designs.
It remains to be seen whether the Intel Granite Rapids implementation of VP2INTERSECT is also good.
> This section has been redacted until August 14.
Could you repost it then?
Related
The AMD Zen 5 Microarcitecure
AMD introduced Zen 5 CPU microarchitecture at Computex 2024, launching Ryzen AI 300 for mobile and Ryzen 9000 for desktops. Zen 5 offers improved IPC, dual-pipe fetch, and advanced branch prediction. Ryzen AI 300 includes XDNA 2 NPU and RDNA 3.5 graphics, while Ryzen 9000 supports up to 16 cores and 5.7 GHz boost clock.
A Video Interview with Mike Clark, Chief Architect of Zen at AMD
The interview with AMD's Chief Architect discussed Zen 5's enhancements like improved branch predictor and schedulers. It optimizes single-threaded and multi-threaded performance, focusing on compute capabilities and efficiency.
The AMD Zen 5 Microarchitecture
AMD revealed Zen 5 microarchitecture at Computex 2024, launching Ryzen AI 300 series for mobile and Ryzen 9000 series for desktop. Zen 5 brings enhanced performance with XDNA 2 NPU, RDNA 3.5 graphics, and 16% better IPC than Zen 4.
An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast
AMD's Mike Clark discusses Zen 5 architecture, covering 4nm and 3nm nodes. 4nm chips launch soon, with 3nm to follow. Zen 'c' cores may integrate into desktop processors. Zen 5 enhances Ryzen CPUs with full AVX-512 acceleration, emphasizing design balance for optimal performance.
Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks
AMD's Zen 5 architecture features a new 2-Ahead Branch Predictor Unit, enhancing instruction fetching and execution efficiency, particularly for x86 architectures, and significantly improving single-core performance.