June 23rd, 2024

Microbenchmarking Return Address Branch Prediction (2018)

Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.

Read original articleLink Icon
Microbenchmarking Return Address Branch Prediction (2018)

Modern processors utilize branch predictors like the Return Address Stack (RAS) to enhance performance by predicting a program's control flow. This article delves into microbenchmarking the behavior of RAS in various Intel and AMD processor microarchitectures. It explores the accuracy of return prediction compared to indirect branches, RAS capacity, and behavior post-pipeline flushes. Notably, AMD Bulldozer exhibits peculiar RAS behavior, while Intel's RSBs function akin to circular arrays. The microbenchmarking extends to measuring the maximum number of speculative branches and calls in-flight, revealing limitations in different architectures. The article emphasizes the significance of accurate branch prediction for high-performance in deeply-pipelined processors, focusing on the specialized predictor for function return instructions. It categorizes branch instructions based on conditionality and directness, highlighting the challenges posed by indirect branches like function returns. The study showcases the impact of properly matched calls and returns on prediction accuracy, as well as the consequences of misaligned returns and incorrect return targets. Special cases like CALL +0 are also examined, shedding light on how processors handle unique scenarios.

Related

Exploring How Cache Memory Works

Exploring How Cache Memory Works

Cache memory, crucial for programmers, stores data inside the CPU for quick access, bridging the CPU-RAM speed gap. Different cache levels vary in speed and capacity, optimizing performance and efficiency.

Own Constant Folder in C/C++

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Memory Model: The Hard Bits

Memory Model: The Hard Bits

This chapter explores OCaml's memory model, emphasizing relaxed memory aspects, compiler optimizations, weakly consistent memory, and DRF-SC guarantee. It clarifies data races, memory classifications, and simplifies reasoning for programmers. Examples highlight data race scenarios and atomicity.

How to Design an ISA

How to Design an ISA

The article explores designing Instruction Set Architectures (ISAs), focusing on RISC-V's rise. David Chisnall highlights ISA's role as a bridge between compilers and microarchitecture, emphasizing the challenges and importance of a well-designed ISA for optimal performance in various computing environments.

How GCC and Clang handle statically known undefined behaviour

How GCC and Clang handle statically known undefined behaviour

Discussion on compilers handling statically known undefined behavior (UB) in C code reveals insights into optimizations. Compilers like gcc and clang optimize based on undefined language semantics, potentially crashing programs or ignoring problematic code. UB avoidance is crucial for program predictability and security. Compilers differ in handling UB, with gcc and clang showing variations in crash behavior and warnings. LLVM's 'poison' values allow optimizations despite UB, reflecting diverse compiler approaches. Compiler responses to UB are subjective, influenced by developers and user requirements.

Link Icon 2 comments
By @rep_lodsb - 7 months
>Special case: CALL +0 is not a call

There seems to be still a lot of documentation out there that says that you should never pop the return address, and instead call a "proper" function that reads it from the stack before returning normally.

I wonder if now that recent-ish processors treat CALL +0 as a special case, there is instead a performance bug when not popping the return address, with code like this:

    do_something_twice:
        call  do_something_once
        ;fall through to run next piece of code again
    do_something_once:
        ;...
        ret
Both of these uses of CALL would never appear in compiler output, and probably not be common in handwritten assembly either, especially from people who are aware of the "common wisdom" of how to get the instruction pointer. But there must have been at least one high-profile piece of code where this had a real performance impact, or Intel/AMD wouldn't have bothered to optimize this?
By @camkego - 7 months
Too bad the source code from the blog post cannot be accessed. It returns “forbidden permission”