July 26th, 2024

Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks

AMD's Zen 5 architecture features a new 2-Ahead Branch Predictor Unit, enhancing instruction fetching and execution efficiency, particularly for x86 architectures, and significantly improving single-core performance.

Read original article

CuriositySkepticismInterest

Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks

AMD's Zen 5 architecture introduces a significant enhancement with its new 2-Ahead Branch Predictor Unit, which builds on concepts from research dating back 30 years. This redesign aims to improve instruction fetching and execution efficiency in modern microprocessors. The branch predictor addresses the challenge of conditional jumps in program execution, which can stall the pipeline and waste processing time. By predicting instruction sequences, the processor can maintain a filled pipeline, thus enhancing performance.

The 2-Ahead Branch Predictor allows the processor to look ahead in the instruction stream, enabling it to handle two taken branches per cycle. This capability is facilitated by dual-porting the instruction fetch and operation cache, allowing for more efficient data handling. The architecture supports three prediction windows, optimizing instruction decoding and reducing bandwidth hits when branches are taken.

The design is particularly beneficial for x86 architectures, which face unique challenges due to their variable-length instruction sets. The 2-Ahead Branch Predictor's implementation in Zen 5 is expected to significantly enhance single-core performance, a focus that has resurfaced as technology advances. Overall, this innovation positions Zen 5 as a critical step forward for AMD, setting the stage for future developments in the Zen architecture.

Microbenchmarking Return Address Branch Prediction (2018)

Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

The interview with AMD's Chief Architect discussed Zen 5's enhancements like improved branch predictor and schedulers. It optimizes single-threaded and multi-threaded performance, focusing on compute capabilities and efficiency.

The AMD Zen 5 Microarchitecture

AMD revealed Zen 5 microarchitecture at Computex 2024, launching Ryzen AI 300 series for mobile and Ryzen 9000 series for desktop. Zen 5 brings enhanced performance with XDNA 2 NPU, RDNA 3.5 graphics, and 16% better IPC than Zen 4.

An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast

AMD's Mike Clark discusses Zen 5 architecture, covering 4nm and 3nm nodes. 4nm chips launch soon, with 3nm to follow. Zen 'c' cores may integrate into desktop processors. Zen 5 enhances Ryzen CPUs with full AVX-512 acceleration, emphasizing design balance for optimal performance.

AI: What people are saying

The comments on AMD's Zen 5 architecture reveal a mix of curiosity and concern regarding its new features and implications for performance.

Several users express confusion about the specifics of the 2-ahead branch predictor and its functionality.
There are discussions about the potential for increased core counts and the implications for server performance.
Concerns are raised about the security vulnerabilities associated with speculative predictors.
Some users share insights and resources on branch prediction, indicating a desire for deeper understanding.
Others reflect on the evolution of technology and its impact on performance and efficiency.

15 comments

By @gary_0 - 9 months

Here's a great explanation of branch prediction, starting from the earliest implementations: https://danluu.com/branch-prediction/

By @ksec - 9 months

It will be interesting to see the SMT performance, I am expecting this would provide benefits and be further refined in future generation. With Zen5c we get 192 Core or 384vCPU. We should be getting 256 Core with Zen 6c next year. Which means on a Dual Socket 1U Server, that is a potential of 512 Core with 1024 vCPU.

Whatever Web App Scaling issues we had in 2014 could now fit into a single server, assuming we somehow manage to cool the thing. Even at 1 RPS per vCPU that is 1000 RPS, excluding cache hit. Even on HN front-page dont hit the server at 1000 Page View per second.

By @IvanAchlaqullah - 9 months

It's always interesting to see decades old papers, sometimes published with little to no fanfares, suddenly becomes "state of the art" because hardware have become powerful enough.

For example Z-buffers[1]. It's used by 3d video games. When it's first published on paper, it's not even the main topic of the paper, just some side notes because it requires expensive amount of memory to run.

Turn out megabytes is quite cheap few decades latter, and every realtime 3d renderer ended up using it.

[1] https://en.wikipedia.org/wiki/Z-buffering

By @mrlonglong - 9 months

Speculative predictors have been subjected to a number of attacks to weasel out private data. Given that so many of the common ISAs are vulnerable, are they taking steps to reduce the impact of such attacks?

By @emn13 - 9 months

As a novice in this area, it's not clear to me after reading this what exactly the 2-ahead branch predictor is.

By @pyrolistical - 9 months

We would need more branch hints? https://github.com/ziglang/zig/issues/5177

Cold, warm, warmer, omit hot as it is the default? Sometimes you would set all branches to be cold except one

By @Szpadel - 9 months

that's probably bad idea but I would like to learn why:

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

is this that much harder or there are other reasons that makes this not worth it

By @hnpl - 9 months

I'd love to see the performance data before judging whether it is a good idea. There's no information on the branch prediction penalty of this approach as well.

Anyway, I think the intuition of this approach is to aggressively fetch/decode instructions that might not already in L1 instruction cache / micro-op cache. This is important for x86 (and probably RISC-V) because both have varied instruction length, and just by looking at an instruction cache block, the core wouldn't know how to decode the instruction in the cache block. Both ISAs (x86, RISC-V) require knowing the PC of at least one instruction to start decoding an instruction cache block. So, knowing where the application can jump to 2 blocks ahead helps the core fetching/decoding further ahead compared to the current approach.

This approach is comparable to instruction prefetching, though, instruction prefetching does not give the core information about the starting point.

(High performance ARM cores probably won't suffer from the "finding the starting point" problem because every instruction has the length of 32-bit, so the decoding procedure can be done in parallel without knowing a starting point).

This approach likely benefits front-end heavy applications (applications with hot code blocks scatter everywhere in the binary, e.g., cloud workloads). I wonder if there's any performance benefit/hit for other types of applications.

By @im3w1l - 9 months

I still ahve no idea what a 2-ahead branch predictor is.

By @vegabook - 9 months

Now all it needs is more memory bandwidth, because those two memory channels on the consumer AM5 socket are pathetic given the compute this will deliver, and especially in comparison with even the most basic ASi.

I moved to an M2 Max from a chunky Zen setup and it's a revelation how much the memory bandwidth improvement accelerates intensive data work. Also for heavy-ish multitasking the Zen setup's narrow memory pipe would often choke.

By @yogrish - 9 months

When I began my career as a DSP optimization engineer, I worked on ZSP processors. I had a particular passion for branch predictions. By analyzing data, we would code and recode branch instructions to minimize mispredictions. I loved that super- scalar architecture of ZSP and its Branch prediction feature was amazing, apart from its multi Instruciton execution in a cycle.

By @phkahler - 9 months

>> Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned.

This sounds like a big boost for hyper threading performance. My Zen1 gets about 25 percent faster due to HT. Has anyone tested the newer ones in this regard?

By @ryukoposting - 9 months

4 years after graduating college, my decision to dive into computer architecture classes has borne no fruit except my ability to loosely understand what writeups like this are talking about. But, I guess that's the point, isn't it? This is fascinating stuff, whether or not I need to know it.

By @brcmthrowaway - 9 months

Did the paper author get a cut from AMD?

By @sholladay - 9 months

Despite being aware of AMD’s Zen chips, for a moment I thought this was about a ZEN player. Good times!

https://en.wikipedia.org/wiki/Creative_Zen

Zen 5's 2-Ahead Branch Predictor: How a 30 Year Old Idea Allows for New Tricks

Related

Microbenchmarking Return Address Branch Prediction (2018)

Beating the L1 cache with value speculation (2021)

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

The AMD Zen 5 Microarchitecture

An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast

Related

Microbenchmarking Return Address Branch Prediction (2018)

Beating the L1 cache with value speculation (2021)

A Video Interview with Mike Clark, Chief Architect of Zen at AMD

The AMD Zen 5 Microarchitecture

An interview with AMD's Mike Clark, 'Zen Daddy' says 3nm Zen 5 is coming fast