July 3rd, 2024

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Read original article

The blog post discusses the author's experience with optimizing AArch64 assembly code by reducing the number of jumps in a loop. Initially, the author attempted to optimize the code by eliminating one jump but found that it actually slowed down the program. This unexpected result was attributed to the branch predictor being confused by mismatched bl/ret pairs. By replacing ret with br x30, the performance regression was fixed, resulting in faster execution. The post delves into the technical details of branch prediction and provides insights into optimizing assembly code for speed, including inlining functions and utilizing SIMD instructions. Benchmark results show significant speed improvements with each optimization step, culminating in a final program running approximately 8.8 times faster than the initial version. The post concludes by highlighting the trade-offs of using SIMD instructions for performance and the importance of considering the order of operations in floating-point addition for accurate results.

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Microbenchmarking Return Address Branch Prediction (2018)

Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.

Spending too much time optimizing for loops

Researcher Octave Larose shared insights on optimizing Rust interpreters, focusing on improving performance for the SOM language. By enhancing loop handling and addressing challenges, significant speedups were achieved, balancing code elegance with efficiency.

Integrated assembler improvements in LLVM 19

LLVM 19 brings significant enhancements to the integrated assembler, focusing on the MC library for assembly, disassembly, and object file formats. Performance improvements include optimized fragment sizes, streamlined symbol handling, and simplified expression evaluation. These changes aim to boost performance, reduce memory usage, and lay the groundwork for future enhancements.

Spending too much time optimizing for loops

Researcher Octave Larose discussed optimizing Rust interpreters, focusing on improving performance for the SOM language. They highlighted enhancing loop efficiency through bytecode and primitives, addressing challenges like Rust limitations and complex designs. Despite performance gains, trade-offs between efficiency and code elegance persist.

21 comments

By @allenrb - 10 months

Was summarizing this article for a group of friends who largely met during the Apple II days and wanted to repost a bit of that here:

The optimized code at the end takes 94 nanoseconds to sum an array of 1024 32-bit floating point numbers.

In 94 nanoseconds, our old friend the 1 MHz 6502, would be just starting to consider signaling the memory chips that maybe they ought to try digging up the first byte of the first instruction in the program.

Worth mentioning, that code is entirely dependent on running in cache. Otherwise even the mighty M1 Max in the post would still be stuck waiting on that first memory fetch. DRAM is slow. :-)

By @AshamedCaptain - 10 months

The same thing was covered by Raymond Chen almost 2 decades ago: https://devblogs.microsoft.com/oldnewthing/20041216-00/?p=36...

By @bee_rider - 10 months

> The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!

> Does this matter for your use case? Only you can know!

Anything is possible of course, and the typical way of writing a loop to sum up an array is literally telling the computer to go element by element and accumulate the values.

But it seems pretty unlikely that doing, say, four accumulations in parallel with simd and then summing them up at the end is any more wrong than going element by element.

Summing floats should by default be taken to have error bounds, and any answer in those bounds is valid. If you know something special about the floats you are inputting, the language should have some means to explicitly encode that. It shouldn’t be the most basic loop, that’s the default case, so it should give the best performance.

By @jcranmer - 10 months

> After checking for loop termination, we fall through into the foo function (without a branch!)

Literally just reading this line made me go "well, there's your problem." I thought this was going to be something deep about fancy branch predictor heuristics, but it turns out to be a violation of basic heuristics.

Folks, don't think you're going to get amazing speedups by using mismatched call/ret instructions. The branch predictor keeping a shadow stack of return addresses is something that's been around for decades.

By @mrinterweb - 10 months

Classic SNL reference: "Do not taunt happy fun ball"

https://www.youtube.com/watch?v=GmqeZl8OI2M

By @croemer - 10 months

Don't miss the 2023! The article really is already a little outdated: since Rust 1.78, the compiler uses some more aggressive loop unrolling (and a little bit of SIMD): https://godbolt.org/z/zhbobW7rr

OP wrote: "Looking at the assembly, it's doing some loop unrolling.", linking to https://godbolt.org/z/Kv77abW6c, which is using "Rust Nightly", a moving target. By now, there's more loop unrolling.

Loop unrolling started with Rust 1.59: https://godbolt.org/z/5PTnWrWf7

Update: The code on Github shows they were using Rust version 1.67.0-nightly, 2022-11-27.

By @CrazyStat - 10 months

(2023)

Discussion at the time: https://news.ycombinator.com/item?id=34520498

By @jcul - 10 months

I'm not super familiar with ARM / ARM64 assembly and was confused as to how x0 was incremented. Was going to ask here, but decided to not be lazy and just look it up.

  const float f = *data++;


  ldr s1, [x0], #4

Turns out this instruction loads and increments x0 by 4 at the same time. It looks like you can use negative values too, so could iterate over something in reverse.

Kind of cool, I don't think x86_64 has a single instruction that can load and increment in one go.

By @monitron - 10 months

This was a lovely article. I only wish the author wouldn’t have kept switching units (between µs and ns) making it hard to scan the tables for comparison.

By @trealira - 10 months

I'm surprised they didn't try something less clever to optimize their code first. The assembly code could be rewritten like this, so the test requires one branch at the bottom of the loop instead of two (one to jump to the top and one to branch conditionally), as well as one instead of two ALU operations per loop for X1 (one subtraction to compare X1 to zero, and one to decrement X1, as opposed to using the result of the latter).

      stp   x29, x30, [sp, #-16]!
      fmov s0, #0.0

      cbnz x1, exit // Skip loop if x1 == 0.
  loop:
      ldr s1, [x0], #4
  
      bl foo
      // Decrement x1. If x1 != 0, repeat the loop.
      subs x1, x1, #1
      b.ne loop
      // Fall through to exit after the loop finishes.
  exit:
      ldp   x29, x30, [sp], #16
      ret
  
  foo:
      // ...
      ret

Going further, you could just inline foo and omit the RET instruction without doing any trickery with mismatched BL and RET instructions.

I haven't benchmarked this, though, so I don't actually know how much this speeds the code up.

By @orbital-decay - 10 months

Tangential question, perhaps a dumb one: why do most hardware schedulers try to predict the load on their own, instead of letting users explicitly signal their intent? It's everywhere, from CPUs to storage.

By @Taniwha - 10 months

I was working on an x86 clone back in the 90s (sadly never saw the light of day), one of the things I did early on was to write code to emulate various potential microarchitectural designs by capturing long streams of real world instructions - our real bugbear was one of the C++ compilers (used to compile a popular web browser from memory) that did virtual method dispatch by pushing an address onto the stack and returning to it - broke return predictors every time

By @samatman - 10 months

Everything I know, or need to know, about branch prediction, I learned from James Mickens.

https://scholar.harvard.edu/files/mickens/files/theslowwinte...

By @michaelcampbell - 10 months

<tangent> IIRC, one of the ways the 6502 would allow jumps past its normal boundaries was to PUSH the out of bounds address you wanted to go to onto whatever register then do the asm "return from subroutine" opcode.

Perhaps that's old hat and not even allowed any more, but I thought it was pretty clever when I first read about it.

By @nayuki - 10 months

> Why do we need a special function return instruction? Functionally, BR LR would do the same job as RET. Using RET tells the processor that this is a function return.

I'm not sure this is a good design. Those two opcodes perform the same logic function but have different branch prediction hints.

Meanwhile, there are other cases where an existing instruction is reinterpreted with new semantics:

* On x86, `XCHG eax, eax` is `NOP`.

* On x86, `XOR reg, reg` is `MOV reg, 0` and breaks dependencies for the purposes of register renaming.

* On x86, various examples of macro-op fusion and micro-op fusion.

* On various RISC architectures, `ADD r1, r0, 1234` is `MOV r1, 1234`.

* On some RISC architectures, conditionally branching if r0 == 0 is the only way to express an unconditional branch.

I see no reason why `BR LR` can't be the standard function return instruction and involve the branch predictor.

By @quotemstr - 10 months

Yeah. This effect is why, also, using a bare CALL without a RET ends up being a slow way to get the program counter on 32-bit x86, which lacks native PC-relative addressing: it confuses the CPU's tracking of calls and returns. Basically, you always want to balance them. No exceptions.

By @michaelcampbell - 10 months

ok, how many people here under 45 years old actually GET the "taunt happy fun" reference?

By @petsounds - 10 months

The title of this article is a reference to "Happy Fun Ball", my favorite SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M

By @38 - 10 months

> const float f = *data++;

cursed code. people might say "oh that's normal idiomatic C", I don't care. I am glad Go does not allow code like this

Do not taunt happy fun branch predictor

Related

Own Constant Folder in C/C++

Microbenchmarking Return Address Branch Prediction (2018)

Spending too much time optimizing for loops

Integrated assembler improvements in LLVM 19

Spending too much time optimizing for loops

Related

Own Constant Folder in C/C++

Microbenchmarking Return Address Branch Prediction (2018)

Spending too much time optimizing for loops

Integrated assembler improvements in LLVM 19

Spending too much time optimizing for loops