August 7th, 2024

Zen5's AVX512 Teardown and More

AMD's Zen5 architecture enhances AVX512 capabilities with native implementation, achieving 4 x 512-bit throughput, while facing thermal throttling challenges. It shows significant performance gains, especially in high-performance computing.

Read original articleLink Icon
Zen5's AVX512 Teardown and More

Zen5, AMD's latest processor architecture, introduces significant advancements in AVX512 capabilities, marking a notable improvement over its predecessor, Zen4. Unlike Zen4, which utilized a "double-pumping" method to handle 512-bit instructions through 256-bit hardware, Zen5 features a fully native implementation with expanded datapaths and execution units capable of 4 x 512-bit execution throughput. This leap allows Zen5 to outperform Intel in SIMD execution, a domain where Intel has historically led. However, the mobile variant, Strix Point, retains a 256-bit throughput, indicating a bifurcation in AMD's architecture. Performance metrics for Zen5 show varied IPC improvements across workloads, with a notable 96-98% increase in AVX512 performance, while other areas, such as 128-bit SSE, experienced regressions. The architecture also faces challenges with thermal throttling under sustained AVX512 workloads, although AMD's approach to throttling differs from Intel's, aiming to minimize negative impacts. Overall, Zen5's enhancements position AMD favorably against Intel, particularly in high-performance computing scenarios.

- Zen5 features a fully native AVX512 implementation, doubling the execution throughput compared to Zen4.

- The architecture bifurcates into different core types, with mobile variants maintaining 256-bit throughput.

- IPC improvements vary significantly across workloads, with AVX512 showing the most substantial gains.

- Zen5 experiences thermal throttling under sustained AVX512 workloads, but AMD's throttling method aims to reduce negative effects.

- AMD's advancements in SIMD execution mark a significant shift in competitive dynamics with Intel.

Related

The AMD Zen 5 Microarcitecure

The AMD Zen 5 Microarcitecure

AMD introduced Zen 5 CPU microarchitecture at Computex 2024, launching Ryzen AI 300 for mobile and Ryzen 9000 for desktops. Zen 5 offers improved IPC, dual-pipe fetch, and advanced branch prediction. Ryzen AI 300 includes XDNA 2 NPU and RDNA 3.5 graphics, while Ryzen 9000 supports up to 16 cores and 5.7 GHz boost clock.

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

The interview with AMD's Chief Architect discussed Zen 5's enhancements like improved branch predictor and schedulers. It optimizes single-threaded and multi-threaded performance, focusing on compute capabilities and efficiency.

The AMD Zen 5 Microarchitecture

The AMD Zen 5 Microarchitecture

AMD revealed Zen 5 microarchitecture at Computex 2024, launching Ryzen AI 300 series for mobile and Ryzen 9000 series for desktop. Zen 5 brings enhanced performance with XDNA 2 NPU, RDNA 3.5 graphics, and 16% better IPC than Zen 4.

An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast

An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast

AMD's Mike Clark discusses Zen 5 architecture, covering 4nm and 3nm nodes. 4nm chips launch soon, with 3nm to follow. Zen 'c' cores may integrate into desktop processors. Zen 5 enhances Ryzen CPUs with full AVX-512 acceleration, emphasizing design balance for optimal performance.

Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks

Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks

AMD's Zen 5 architecture features a new 2-Ahead Branch Predictor Unit, enhancing instruction fetching and execution efficiency, particularly for x86 architectures, and significantly improving single-core performance.

Link Icon 15 comments
By @PaulHoule - 5 months
Intel's handling of SIMD is representative of Intel's value-subtracting principles that have caused Intel to stagnate in the past 15 years or so.

It is an intrinsic problem with SIMD as we know it that you have to recode your application to support new instructions, which is a big hassle. Most people and companies will give up on supporting the latest and greatest and will ship binaries that meet the lowest common denominator. For instance it took forever for Microsoft to rely on instructions that were available almost 15 years ago.

As Charlie Demerjian has pointed out for years consumers are waking up to the fact that a 2024 craptop isn't much better than a 2014 crapbook and there is zero credibility in claims about "Ultrabooks", "AI PCs", etc. What could make a difference is a coordinated effort end-to-end to widely deploy the latest developments as quickly as possible across as much of the product line as possible, to get tooling support for them, and drive developers to adopt them as quickly as possible. As it is Intel will boast about how the national labs are blowing up H-bombs in VR faster than they ever had, and Facebook is profiling users more efficiently than before and not realize that customers don't believe these advances are going to make a difference for them so instead of buying a new PC which might deliver better performance when software (maybe) catches up in 7-8 years they are going to hold on to old machines longer.

By @ComputerGuru - 5 months
Great article. It really drives home what a damn shame Intel's persistent mishandling of AVX512 has been ever since its introduction. I don't even know if it has a future outside of extremely niche libraries given how scattered hardware support for it is on Intel's side.

On a completely different topic: I wasn't expecting the redacted portion of the article due to AMD's embargo to take away too much of the article but the first half (up until the discussion about AVX512) would clearly be much more interesting with the censored out parts. I guess someone will have to resubmit this come August 14th!

By @drewg123 - 5 months
The most interesting bit about this article for me is the "transition time" to get the power needed use AVX-256 or AVX-512 which is present on Intel, but not AMD zen4/zen5. It explains some behavior that I saw years ago when implementing kTLS on FreeBSD, and validates our design of having per-core kTLS crypto worker threads, rather than doing the crypto in the context of sosend() or sendfile's tcp_usr_ready().
By @Remnant44 - 5 months
AMD's avx512 implementation is just lovely and they seem to be firing on all cylinders for it. Zen4 was already great, 'double pumped' or no.

It looks like Zen5's support is essentially the dream - all EUs and load/store are expanded to 512 bit, so you can sustain 2 512 FMAs and 2 512 Adds every cycle. There also appears to be essentially no transition penalty to the full-power state which is incredible.

The only thing sad here is that all this work to enable full-width AVX512 is going to be mostly wasted as approximately 0% of all client software will get recompiled to an AVX512 baseline for decades if ever. But if you can compile for your own targets, or JIT for it.. it looks really good.

By @stagger87 - 5 months
Seeing 2 consumer CPU generations in a row not only support but improve AVX512 capabilities will hopefully go a long ways towards regaining the confidence of the developers that use AVX512 in the consumer space. I know I personally have been holding back as I watched Intel fumble AVX512 for the last 10 years. With their even more recent fumbles there could be a near future where AMD CPUs have majority market share in both desktop and mobile. Great news for developers that can use AVX512.
By @Manabu-eo - 5 months
AMD fixed vpcompressd in Zen5:

> Hazards Fixed:

> V(P)COMPRESS store to memory is fixed. (3 cycles/store to non-overlapping addresses)

> The super-alignment hazard is fixed.

I initially tried searching for the string, but the () thwarted that.

It used to be 142 cycles/instruction for "vpcompressd [mem]{k}, zmm" in zen4.

By @kolbe - 5 months
Mystical (the author) does such fantastic work for the CS community. I really like that guy. A compilation of his stack overflow answers on SIMD would be better than any available book.

His intelligence and openness, despite no one paying him for it, shines such a bad light on the terrible state of academia. That he was considered a "bad student" is near-proof in and of itself that our system judges people catastrophically poorly.

By @ipsum2 - 5 months
The part of the article I found most amusing:

"Intel added AVX512-VP2INTERSECT to Tiger Lake. But it was really slow. (microcoded ~25 cycles/46 uops) It was so slow that someone found a better way to implement its functionality without using the instruction itself. Intel deprecates the instruction and removes it from all processors after Tiger Lake. (ignoring the fact that early Alder Lake unofficially also had it) AMD adds it to Zen5. So just as Intel kills off VP2INTERSECT, AMD shows up with it. Needless to say, Zen5 had probably already taped out by the time Intel deprecated the instruction. So VP2INTERSECT made it into Zen5's design and wasn't going to be removed.

But how good is AMD's implementation? Let's look at AIDA64's dumps for Granite Ridge:

AVX512_VP2INTERSECT :VP2INTERSECTQ k1+1, zmm, zmm L: [diff. reg. set] T: 0.23ns= 1.00c

Yes, that's right. 1 cycle throughput. ONE cycle. I can't... I just can't...

Intel was so bad at this that they dropped the instruction. And now AMD finally appears and shows them how it's done - 2 years too late."

By @adrian_b - 5 months
I want to add a fact about which the author was not aware.

While as he says, after Tiger Lake Intel has deprecated VP2INTERSECT, then they have changed their mind and they have added it again in the server CPU Granite Rapids, which will be launched in a few months.

Moreover, the ISA of Granite Rapids is considered to be AVX10.1 and all its instructions, including VP2INTERSECT, will be a mandatory part of the ISA of all future Intel CPUs from 2026 on.

Therefore it is good that AMD has achieved an excellent implementation of VP2INTERSECT, which they will be able to carry into their future designs.

It remains to be seen whether the Intel Granite Rapids implementation of VP2INTERSECT is also good.

By @jiggawatts - 5 months
Something I've wondered about is why CPUs have separate scalar and vector ALUs. Would it be possible to simply have 16x scalar ALUs that can be used either individually or ganged together in groups of 4, 8, or 16 to execute the various vector instructions?
By @pixelpoet - 5 months
What an excellent writeup, thx for sharing
By @Cold_Miserable - 5 months
Does Zen 5 have cldemote or senduipi ?
By @andy_xor_andrew - 5 months
out of curiosity, what applications might I see this used for?
By @altairprime - 5 months
Ugh, half of this article is

> This section has been redacted until August 14.

Could you repost it then?