July 12th, 2024

Beating the Compiler

The blog post discusses optimizing interpreters in assembly to outperform compilers. By enhancing the Uxn CPU interpreter, a 10-20% speedup was achieved through efficient assembly implementations and techniques inspired by LuaJIT.

Read original article

In the blog post "Beating the compiler," the author explores the idea of outperforming compilers by writing interpreters in assembly language. Specifically, they discuss their experience in creating a fast interpreter for the Uxn CPU, a stack-based architecture with 256 opcodes. By optimizing the interpreter's implementation in assembly, they achieved a 10-20% speedup over the reference version. The post delves into the details of the assembly code generated by the compiler, highlighting possible inefficiencies and proposing optimizations such as storing critical values in registers and using indirect threading to eliminate the dispatch loop. The author draws inspiration from LuaJIT, known for its speed due to similar assembly-level optimizations. The post concludes with insights into register assignment, indirect threading, and the implementation of multiple opcodes in assembly, totaling around 2400 lines. The approach involves a mix of assembly and Rust code, with a focus on performance enhancements through manual optimization techniques.

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Spending too much time optimizing for loops

Researcher Octave Larose shared insights on optimizing Rust interpreters, focusing on improving performance for the SOM language. By enhancing loop handling and addressing challenges, significant speedups were achieved, balancing code elegance with efficiency.

Integrated assembler improvements in LLVM 19

LLVM 19 brings significant enhancements to the integrated assembler, focusing on the MC library for assembly, disassembly, and object file formats. Performance improvements include optimized fragment sizes, streamlined symbol handling, and simplified expression evaluation. These changes aim to boost performance, reduce memory usage, and lay the groundwork for future enhancements.

Spending too much time optimizing for loops

Researcher Octave Larose discussed optimizing Rust interpreters, focusing on improving performance for the SOM language. They highlighted enhancing loop efficiency through bytecode and primitives, addressing challenges like Rust limitations and complex designs. Despite performance gains, trade-offs between efficiency and code elegance persist.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

20 comments

By @gumby - 10 months

IMHO the best way to think of it (well, it's how I have long thought of it) is two lemmas:

1 - The compiler has more "fingers" than a human does. Back when I wrote programs in assembly I would use printout to keep track of what I was doing and for debugging, and so would often put a finger on the paper to mark where a jump was and then go to the destination to see if it matched up, etc. This process isn't at all scalable, so the compiler is inevitably going to do better than I ever could on anything more than something trivial.

2 - I know more about the program constraints than I could ever express (much less be bothered expressing) in a HLL. "All the values will be less than 16" or "I can use this memory location for something else after I've read from it" or who knows what. So sometimes I can chop stuff out or even rewrite the compiler output locally to speed it up.

Also I can only do this for a few architectures...and with modern x86 there are so many variants with all sorts of microoptimization affordances that I can rarely beat the compiler (this is a corollary of lemma 1)

By @phire - 10 months

> The dispatch loop takes a single indirect branch to the opcode-specific implementation. This means that the branch will be nigh unpredictable!

Modern branch predictors can actually predict indirect branches with multiple destinations, because they hash recent branch history into the prediction. The exact same indirect branch will end up with multiple BTB entries, based on previous control flow.

I was curious where the threaded code speedup is actually coming from. It's possible much of the speedup is coming from eliminating the extra branch back to the dispatch loop. Or maybe the history tracking in the M1's branch predictor doesn't work well for this type of control flow.

So I checked out this code and modified the next macro, adding a dummy branch over a nop, to roughly isolate this factor.

On my M1, the extra unconditional branch benchmarks at 1.08 sec for fib, and 1.19 sec for Mandelbrot (all other times were the same).

Looks like the speedup is a mixture of the two. Eliminating that extra branch is responsible for about 20-30% of the speedup, and improving prediction on indirect branches is responsible for the other 70-80%

By @teo_zero - 10 months

I was intrigued by this paragraph:

> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely. This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.

Probably normalizing the opcode implementations length to the maximum length is not optimal. A possible experiment would be to normalize to the modal length. I would expect most opcodes to be arithmetic or logic operations and to share the same implementation length. The (hopefully few) more complex ones would need a trampoline.

Is this a crazy idea?

By @interpunct - 10 months

> I've proven to my satisfaction that writing an interpreter in assembly is both fun and performant!

Fun maybe/maybe not, but define "performant". I might drop into assembly for 1 or 2 orders of magnitude faster for a non-toy project. Even then, what are my requirements? What is the bus factor of LuaJIT again? Try getting support for s390x, or your patch accepted.

Look at the speedup of Lua v. LuaJIT (C language VM v. C/Lua VM with JIT code gen):

https://web.archive.org/web/20180430211146/https://luajit.or...

By @Validark - 10 months

I wish you wouldn't broadcast the sentiment contained in the first paragraph. Compilers lack the ability to consistently perform many basic optimizations to an embarrassing extent. Including even the ones you would think would be the first optimizations you'd implement when writing a compiler. Open up Godbolt and tell me if you still think the compiler knows best. I try to submit at least one issue to LLVM every time I look at the assembly it gives me. I say "try to" because sometimes it's too much to deal with.

And by the way, probably the absolute worst things the compiler does are the "smart" things. I write my code knowing what the emit should look like, and the compiler sometimes thinks it has a better idea than how I wrote the code and makes the emit a lot more convoluted and slower than it would be if it just did what I said. Actually, I do know the machine decently well, thank you.

Saying "centuries of engineering" is misleading. A lot of those centuries are people arguing about theoretical optimizations of dubious value while we still haven't even finished on the most basic optimizations that are obviously beneficial.

The second paragraph making it sound all mysterious that compilers are even more awful at interpreters than normal code really speaks to a complete lack of exposure to the subject matter. This stuff is only esoteric because you couldn't be bothered to look at godbolt.org. Which is totally fine by the way, just don't go telling me how high quality compilers are if you've never even looked.

That would be like me telling web developers that Dreamweaver produces phenomenal HTML and CSS, without ever looking at what it emits.

Sorry for the rant but it just bothers me how much praise is heaped upon compilers like they are some kind of gift from God, forged by monks, wizards, and unrivaled geniuses. A compiler is just a tool. There is no magic.

By @201984 - 10 months

Very cool! I enjoy seeing people writing asm, and letting us get the most out of our CPUs.

I see you already tried what I thought of, which is getting rid of the jump table and making each instruction handler the same size. Do you think that could still work if you limited each instruction handler to 64 or 32 bytes instead of 256, and then for longer handlers jumped to a larger body of code somewhere else?

By @jll29 - 10 months

This was a good read, thanks.

Instead of using unsafe Rust to optimize the interpreter loop, I would prefer to write a transpiler that compiles the source UXN binaries to local CPU binaries, which would not require making code unsafe/less readable and would permit further speed enhancements by getting rid of an interpreter loop altogether.

By @JackYoustra - 10 months

Did you try PGO? This post seems like the thing it was built for.

By @o11c - 10 months

> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely.

Another idea would be to sort the opcodes by size (32-byte, 64-byte, 128-byte, 256-byte - or perhaps +1 to avoid the set-associativity problem below) rather than simply indexing.

> This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.

This is probably cache-unfriendly due to alignment too (associativity limits mean more conflict misses for highly-aligned memory), not just size. Most documentation only talks about this problem regarding the data cache though ...

By @the8472 - 10 months

      0x100002ec0: b 0x100002d1c       ; jump back to the dispatch loop

I'd expect tail call based control flow - i.e. computing the jump destination at the end of each opcode function - to be nicer to the branch predictor because it could track targets separately.

Currently that's hard to achieve in Rust. There's an experiment in progress to see if guaranteed tail calls can be added to the language

https://github.com/rust-lang/rust/issues/112788

By @rerdavies - 10 months

It would be interesting to see what happens if you turn on profile-guided optimization. This seems like a very good application for PGO.

The 518ms profile result for the branch-table load is very peculiar. To my eyes, it suggests that the branch-table load is is incurring cache misses at a furious rate. But I can't honestly think of why that would be. On a Pi-4 everything should fit in L2 comfortably. Are you using a low-performance ARM processor like an A53?

By @kosolam - 10 months

I would love to understand better this uxn thing? Is it some academic experiment or it has real world applications? What’s the deal with the additional return stack?

By @256_ - 10 months

Why does making the FETCH part of the cycle a macro make it faster? Surely the branch predictor is fine with unconditional immediate branches. What am I missing here?

Also, it has jump-to-register immediately after the instruction that sets that register. Wouldn't it be faster if it went like:

  get jmp address
  execute VM opcode
  jmp to next instruction

So the pipeline can fetch it ahead of time?

By @projektfu - 10 months

Good article, brings the data to back it up.

Unfortunately, it was hard to read with the monokai.css theme because comments were nearly invisible, and a lot of your information was in comments. Changing the color from #75715e to #95917e did the trick. I guess Monokai is for programmers who never read comments.

By @tempodox - 10 months

I find this strange:

  &mut h as *mut _ as *mut _

What is going on here?

By @0x3444ac53 - 10 months

I was overjoyed to open this and find a reference to 100r and UXN

By @lilyball - 10 months

Does the compiler do anything differently if you stick to Rust but change the dispatch implementation to be an #[inline] fn next() that you put at the end of each opcode?

By @ajbt200128 - 10 months

> // SAFETY: do you trust me?

No.

Have you seen the safer-ffi crate? Then you won't have to commit the deadly sin of writing unsafe Rust

Beating the Compiler

Related

My experience crafting an interpreter with Rust (2021)

Spending too much time optimizing for loops

Integrated assembler improvements in LLVM 19

Spending too much time optimizing for loops

Do not taunt happy fun branch predictor

Related

My experience crafting an interpreter with Rust (2021)

Spending too much time optimizing for loops

Integrated assembler improvements in LLVM 19

Spending too much time optimizing for loops

Do not taunt happy fun branch predictor