Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Read original articleThe blog post discusses the author's experience with optimizing AArch64 assembly code by reducing the number of jumps in a loop. Initially, the author attempted to optimize the code by eliminating one jump but found that it actually slowed down the program. This unexpected result was attributed to the branch predictor being confused by mismatched bl/ret pairs. By replacing ret with br x30, the performance regression was fixed, resulting in faster execution. The post delves into the technical details of branch prediction and provides insights into optimizing assembly code for speed, including inlining functions and utilizing SIMD instructions. Benchmark results show significant speed improvements with each optimization step, culminating in a final program running approximately 8.8 times faster than the initial version. The post concludes by highlighting the trade-offs of using SIMD instructions for performance and the importance of considering the order of operations in floating-point addition for accurate results.
Related
Own Constant Folder in C/C++
Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.
Microbenchmarking Return Address Branch Prediction (2018)
Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.
Spending too much time optimizing for loops
Researcher Octave Larose shared insights on optimizing Rust interpreters, focusing on improving performance for the SOM language. By enhancing loop handling and addressing challenges, significant speedups were achieved, balancing code elegance with efficiency.
Integrated assembler improvements in LLVM 19
LLVM 19 brings significant enhancements to the integrated assembler, focusing on the MC library for assembly, disassembly, and object file formats. Performance improvements include optimized fragment sizes, streamlined symbol handling, and simplified expression evaluation. These changes aim to boost performance, reduce memory usage, and lay the groundwork for future enhancements.
Spending too much time optimizing for loops
Researcher Octave Larose discussed optimizing Rust interpreters, focusing on improving performance for the SOM language. They highlighted enhancing loop efficiency through bytecode and primitives, addressing challenges like Rust limitations and complex designs. Despite performance gains, trade-offs between efficiency and code elegance persist.
The optimized code at the end takes 94 nanoseconds to sum an array of 1024 32-bit floating point numbers.
In 94 nanoseconds, our old friend the 1 MHz 6502, would be just starting to consider signaling the memory chips that maybe they ought to try digging up the first byte of the first instruction in the program.
Worth mentioning, that code is entirely dependent on running in cache. Otherwise even the mighty M1 Max in the post would still be stuck waiting on that first memory fetch. DRAM is slow. :-)
> Does this matter for your use case? Only you can know!
Anything is possible of course, and the typical way of writing a loop to sum up an array is literally telling the computer to go element by element and accumulate the values.
But it seems pretty unlikely that doing, say, four accumulations in parallel with simd and then summing them up at the end is any more wrong than going element by element.
Summing floats should by default be taken to have error bounds, and any answer in those bounds is valid. If you know something special about the floats you are inputting, the language should have some means to explicitly encode that. It shouldn’t be the most basic loop, that’s the default case, so it should give the best performance.
Literally just reading this line made me go "well, there's your problem." I thought this was going to be something deep about fancy branch predictor heuristics, but it turns out to be a violation of basic heuristics.
Folks, don't think you're going to get amazing speedups by using mismatched call/ret instructions. The branch predictor keeping a shadow stack of return addresses is something that's been around for decades.
OP wrote: "Looking at the assembly, it's doing some loop unrolling.", linking to https://godbolt.org/z/Kv77abW6c, which is using "Rust Nightly", a moving target. By now, there's more loop unrolling.
Loop unrolling started with Rust 1.59: https://godbolt.org/z/5PTnWrWf7
Update: The code on Github shows they were using Rust version 1.67.0-nightly, 2022-11-27.
const float f = *data++;
ldr s1, [x0], #4
Turns out this instruction loads and increments x0 by 4 at the same time.
It looks like you can use negative values too, so could iterate over something in reverse.Kind of cool, I don't think x86_64 has a single instruction that can load and increment in one go.
stp x29, x30, [sp, #-16]!
fmov s0, #0.0
cbnz x1, exit // Skip loop if x1 == 0.
loop:
ldr s1, [x0], #4
bl foo
// Decrement x1. If x1 != 0, repeat the loop.
subs x1, x1, #1
b.ne loop
// Fall through to exit after the loop finishes.
exit:
ldp x29, x30, [sp], #16
ret
foo:
// ...
ret
Going further, you could just inline foo and omit the RET instruction without doing any trickery with mismatched BL and RET instructions.I haven't benchmarked this, though, so I don't actually know how much this speeds the code up.
https://scholar.harvard.edu/files/mickens/files/theslowwinte...
Perhaps that's old hat and not even allowed any more, but I thought it was pretty clever when I first read about it.
I'm not sure this is a good design. Those two opcodes perform the same logic function but have different branch prediction hints.
Meanwhile, there are other cases where an existing instruction is reinterpreted with new semantics:
* On x86, `XCHG eax, eax` is `NOP`.
* On x86, `XOR reg, reg` is `MOV reg, 0` and breaks dependencies for the purposes of register renaming.
* On x86, various examples of macro-op fusion and micro-op fusion.
* On various RISC architectures, `ADD r1, r0, 1234` is `MOV r1, 1234`.
* On some RISC architectures, conditionally branching if r0 == 0 is the only way to express an unconditional branch.
I see no reason why `BR LR` can't be the standard function return instruction and involve the branch predictor.
cursed code. people might say "oh that's normal idiomatic C", I don't care. I am glad Go does not allow code like this
Related
Own Constant Folder in C/C++
Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.
Microbenchmarking Return Address Branch Prediction (2018)
Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.
Spending too much time optimizing for loops
Researcher Octave Larose shared insights on optimizing Rust interpreters, focusing on improving performance for the SOM language. By enhancing loop handling and addressing challenges, significant speedups were achieved, balancing code elegance with efficiency.
Integrated assembler improvements in LLVM 19
LLVM 19 brings significant enhancements to the integrated assembler, focusing on the MC library for assembly, disassembly, and object file formats. Performance improvements include optimized fragment sizes, streamlined symbol handling, and simplified expression evaluation. These changes aim to boost performance, reduce memory usage, and lay the groundwork for future enhancements.
Spending too much time optimizing for loops
Researcher Octave Larose discussed optimizing Rust interpreters, focusing on improving performance for the SOM language. They highlighted enhancing loop efficiency through bytecode and primitives, addressing challenges like Rust limitations and complex designs. Despite performance gains, trade-offs between efficiency and code elegance persist.