July 1st, 2024

Spending too much time optimizing for loops

Researcher Octave Larose discussed optimizing Rust interpreters, focusing on improving performance for the SOM language. They highlighted enhancing loop efficiency through bytecode and primitives, addressing challenges like Rust limitations and complex designs. Despite performance gains, trade-offs between efficiency and code elegance persist.

Read original articleLink Icon
Spending too much time optimizing for loops

Octave Larose, a researcher, shared insights on optimizing programming language interpreters in Rust. Comparing AST and bytecode interpreters, they found AST interpreters performed well despite common beliefs. Larose focused on enhancing Rust interpreters' performance, particularly for the SOM language. They discussed optimizing loops, like the 'to:do:' method, by implementing specialized bytecode or primitives for efficiency gains. Despite challenges like Rust limitations and complex interpreter designs, they made significant performance improvements by optimizing loop handling. However, implementing certain optimizations led to issues like stack errors, requiring creative solutions like adding flags to frames. While these solutions improved performance, they also introduced complexities and potential slowdowns in some cases. Larose acknowledged the trade-offs between performance gains and code elegance, emphasizing the ongoing efforts to enhance interpreter efficiency.

Related

My experience crafting an interpreter with Rust (2021)

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Optimizing the Roc parser/compiler with data-oriented design

Optimizing the Roc parser/compiler with data-oriented design

The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.

Spending too much time optimizing for loops

Spending too much time optimizing for loops

Researcher Octave Larose shared insights on optimizing Rust interpreters, focusing on improving performance for the SOM language. By enhancing loop handling and addressing challenges, significant speedups were achieved, balancing code elegance with efficiency.

Using SIMD for Parallel Processing in Rust

Using SIMD for Parallel Processing in Rust

SIMD is vital for performance in Rust. Options include auto-vectorization, platform-specific intrinsics, and std::simd module. Balancing performance, portability, and ease of use is key. Leveraging auto-vectorization and intrinsics optimizes Rust projects for high-performance computing, multimedia, systems programming, and cryptography.

I Probably Hate Writing Code in Your Favorite Language

I Probably Hate Writing Code in Your Favorite Language

The author critiques popular programming languages like Python and Java, favoring Elixir and Haskell for immutability and functional programming benefits. They emphasize personal language preferences for hobby projects, not sparking conflict.

Link Icon 8 comments
By @gizmo - 4 months
What this blogposts gets at, indirectly, is optimizing "Zero cost abstractions". In C++ there are many constructs that when translated directly to machine code would result in a severe performance penalty but clever compilers can remove this overhead. (In simple programs. When zero cost abstractions get stacked on top of each other compilers give up and performance craters.)

In languages like Smalltalk Everything Is An Object and instead of function calls messages are sent between objects. It's very elegant. It's up to the compiler to figure out that "1 + 1" doesn't actually require object allocations and message :+ dispatch. Instead all these abstractions can be stripped away so you end up with a single machine code instruction for the addition. In practice this is absolutely hopeless and the compiler will not be able to transform code that is dynamic in theory but static in practice to correct and efficient machine code.

This is for two reasons.

1) Language semantics. If you have multiple threads running then one thread can rewrite the code running on another thread. This means the compiler will be hamstrung in the assumptions it can make about even the most simple for loop. Either you must actually send messages for every integer addition because another thread might be listening to those messages or you disallow multi-threading. In either case your performance will be abysmal.

2) Wide instructions. Modern CPUs can do much work in parallel thanks to SIMD. In a world where every integer is a heap object you have no chance of SIMD optimization.

It's instructive to look at javascript and python. Javascript engines are much faster than python engines because Javascript doesn't do threads. This means the compiler can do tons of static code analysis. No such luck with Python. If you want to go fast with Python you use numpy, which just calls out to C. Here too the optimizations become possible because the language semantics that are holding optimization efforts back are disregarded once the Python code calls into C.

Most of the code we write in practice is pretty static. Dynamic features like virtual function calls, type introspection, and run-time code generation can be added to static languages without much difficulty. On the other hand, it's extremely hard (if not impossible) to get even 10% of the performance your computer is capable of using a highly dynamic language. In practice you don't even get 1%. I know this is an unpopular message but many people have worked hard at making slow languages fast for decades now and it's just not happening.

By @bearjaws - 4 months
semi-related, recently started doing a "leetcode.com" challenge every morning with my coffee.

I am appalled that the top 150 interview questions are 80% iteration avoidance (if you are aiming for good timing).

My experience has always been the problem is how you handle API calls, organize queries, etc... I have maybe 5 times in my career been like 'oh wow we need to avoid this for loop'. Then of course there is managing complexity...

By @brabel - 4 months
You can get a Phd by doing stuff like this?? I do that kind of thing for fun after work on my hobby projects, and at work I've spent weeks speeding up a single HTTP request path.
By @Frieren - 4 months
I see that I am the only one bothered by the title. Optimizing for X is a very common expression, so I read it as "optimizing for loops" as is adding more loops instead of the correct "optimizing for-loops".

Or am I the only one?

By @larsga - 4 months
FWIW, my experience with JSLT https://github.com/schibsted/jslt/ was that I first wrote an AST interpreter. Then I and another person independently wrote bytecode interpreters, but the AST interpreter remains faster than both.

It may be that the JVM is simply better at optimizing one type of code than the other.

By @Thorrez - 4 months
Is this pushing a stack element for every loop iteration before starting the loop execution? If there are a ton of iterations, won't the stack overflow?
By @o11c - 4 months
I'm not sure how useful this is for interpreters in general, given the frankly bizarre design decisions. My normal rant about bad interpreter design assumes things not seen here.
By @diffxx - 4 months
Premature optimization is the root of all evil. Instead of optimizing the performance of the interpreter, they should optimize the language semantics to be fast by default.