August 19th, 2024

Parsing protobuf at 2+GB/s: how I learned to love tail calls in C (2021)

The Clang compiler's `musttail` attribute ensures tail call optimization, enhancing performance in C-based interpreters and parsers, particularly improving Protocol Buffers parsing speed to over 2GB/s.

Read original article

CuriosityOptimismInterest

Parsing protobuf at 2+GB/s: how I learned to love tail calls in C (2021)

The article discusses the introduction of the `musttail` attribute in the Clang compiler, which guarantees tail call optimization in C, C++, and Objective-C. This optimization allows for significant performance improvements, particularly in parsing Protocol Buffers (protobuf) at speeds exceeding 2GB/s, more than double previous benchmarks. The author emphasizes that while tail calls are a crucial component of this speedup, they are not the sole factor. Tail calls reduce stack memory usage and eliminate the overhead of traditional function calls, enabling more efficient algorithm designs. The article also highlights the challenges faced by C compilers in optimizing interpreter loops, which can lead to suboptimal performance. By adopting a tail call-oriented design, the authors were able to create a more efficient protobuf parser that maintains high performance without resorting to assembly language. This design involves breaking down the parsing process into smaller functions, each responsible for a specific operation, which allows for better control over register allocation and code optimization. The article concludes by noting that while the `musttail` feature is not yet widely supported, its availability in Clang represents a significant step toward enhancing performance in C-based interpreters and parsers.

- The `musttail` attribute in Clang guarantees tail call optimization for better performance.

- Tail calls can significantly improve the speed of protobuf parsing, achieving over 2GB/s.

- A tail call-oriented design allows for efficient parsing without using assembly language.

- Smaller, specialized functions improve register allocation and code optimization.

- The `musttail` feature is not yet widely supported, limiting its immediate applicability.

Some Tricks from the Scrapscript Compiler

The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.

Tail Recursion for Macros in C

The blog discusses tail recursion for macros in C, introducing __VA_TAIL__() to enable real recursion, overcoming inefficiencies of traditional recursive macro calls. It showcases how tail recursion improves processing efficiency in macros.

How Clang compiles a function (2018)

The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.

Better Firmware with LLVM/Clang

LLVM and Clang are gaining traction in embedded software development, particularly for ARM Cortex-M devices. The article advocates integrating Clang for better static analysis, error detection, and dual compiler usage.

CPU Dispatching: Make your code both portable and fast (2020)

CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.

AI: What people are saying

The discussion around the Clang compiler's `musttail` attribute reveals several key insights and themes regarding tail call optimization and its implications in programming languages.

There is interest in standardizing tail call optimization across various compilers, with ongoing discussions about its implementation in GCC.
Comments highlight the performance benefits of tail call optimization in interpreters and the potential for improved parsing speeds, particularly in Protocol Buffers.
Some users express curiosity about the interaction of `musttail` with other language features and optimizations, such as cleanup attributes and computed gotos.
There are references to similar proposals in other languages, like Rust, indicating a broader interest in tail call optimization beyond C.
Several commenters emphasize the importance of understanding how to write code that facilitates compiler optimizations, regardless of specific language features.

22 comments

By @fuhsnn - 8 months

A C standard proposal exists for tail call [0], in the form of "return goto (expression);".

What I like about it, over standardizing [[musttail]], is that lifetimes of local objects are guaranteed to end. This makes it possible to implement without extensive escape analysis.

[0] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#5...

By @mbStavola - 8 months

For Rust enthusiasts, there is an old RFC[0] that would have added a "become" keyword which guaranteed tco. It was originally postponed in favor of focusing on the 2018 edition's goals (which was the right call) but the initiative has been revisited recently[1]. Maybe it'll make a comeback!

[0]: https://github.com/rust-lang/rfcs/pull/1888 [1]: https://github.com/rust-lang/rfcs/pull/3407

By @pizlonator - 8 months

The way interpreters usually achieve this kind of speedup in C++ is using computed goto. Then there’s no calling convention junk on the path from one opcode to the next.

Also, the main reason why interpreters get faster using either the computed goto style or the tail style versus a classic switch loop is that it reduces pressure on the branch predictor since there’s statically one indirect branch per opcode rather than statically just one indirect branch.

By @iTokio - 8 months

The remaining issue with tail calls used to switch context, is that you’re using functions that must use a calling convention. And unfortunately they waste registers to restore state on function exit.

See the luajit remake blog for an exhaustive analysis and alternative using an intermediate compiler https://sillycross.github.io/2022/11/22/2022-11-22/

By @gpderetta - 8 months

> I very much hope that the attribute will catch on, spreading to GCC, Visual C++, and other popular compilers,

AFAIK, attribute musttail is in the process of being added to GCC (the patch is under review) with semantics compatible with clang.

By @aidenn0 - 8 months

It mentions C++ support; it would seem to me that C++ has very few tail-calls. Consider:

  foo()
  {
    auto a = SomeClassWithADestructor();
    return bar(); // This is not a tail-call because the destruction of a happens *after* the call to bar();
  }

By @stabbles - 8 months

Maybe the example is too simple, but it does not require `__attribute__((musttail))` for good code gen.

Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?

To me it seems like a structure of the form

   <read data>

   if (unlikely(malformed))
     return error();

   <prep more stuff>

   switch (data_type) {
     case x:
       return handle_x();
     case y:
       return handle_y();
   }

generates a nice jump table quite reliably.

By @irdc - 8 months

I wonder how fast this would be when using a trampoline, i.e. returning the next function as a function pointer and calling that from an outer loop. That would have the advantage of being portable C.

By @crabbone - 8 months

I wrote C Protobuf decoder / encoder as well as IML parser and bindings for Python. Here's something I have to say about measuring parsing speed.

* Since this library is only offered as a binding to some managed languages there are some extra caveats that, in term of performance overshadow everything else. I cannot speak to Ruby or PHP, but with Python I saw a dramatic speed increase when not using enumerators. I.e. if you translate Protobuf's enumerators into Python's enumerators any possible gains you may have in your C code will be trampled by creation times of different Python objects. The difference is many orders of magnitude. Similarly, you could go even further and implement all the supporting data-structures in C, and only expose minimal interface to them in Python. How fair would this sort of comparison be against code that uses Python's built-in structures is the question I cannot answer.

* Google's Protobuf parser for Python would be still "faster" than 2+GB/s because... it doesn't parse anything beside the top-level message. The internal structure of the message is parsed on-demand. This will be probably slower than 2+GB/s if your code instantly reads everything that was parsed, but the obvious question would be: how can you possibly compare these two approaches in practical terms? And, again, there isn't a clear answer, because the practical results will depend on the nature of your application.

* In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates). This means, that, in practice, code that parses Protobuf content will be bottlenecked on I/O (it needs to wait for the end of the message before any parsing starts). Independently, depending on the typical Probobuf message in your application, it might be possible to parallelize parsing, which, again, will most likely outrun any single-threaded parser, but, just as examples above, cannot be said to be a winning strategy in general.

* It's usually a lot more efficient to combine parsing with creation of domain objects, which is a step an application will almost always have to take anyways. How this functionality is accessible from the parser will in many cases determine which parser will win the race.

----

Bottom line: Protobuf (or maybe parser in general) is just a bad candidate to try to measure / compare speeds. It's too low-level and poorly designed to be a good measuring stick for performance benchmarks.

By @beyondCritics - 8 months

With gcc [1] and clang [2] you always had the option "-foptimize-sibling-calls", to get away with tail calls even for debug builds. Of course having this standardized, guaranteed and at the function level is a huge improvement.

[1] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [2] https://clang.llvm.org/docs/ClangCommandLineReference.html#t...

By @MathMonkeyMan - 8 months

One of the compilation passes in a Scheme compiler (e.g. Guile or Racket) is conversion to continuation passing style. Here the author applies the pass manually as a code design decision, because later passes of the compiler (i.e. the actual C compiler) produce better code given that input. It's neat.

By @highfrequency - 8 months

Could someone clarify the example where an unlikely fallback path is wrapped in a musttail?

My understanding is that musttail is useful because it prevents stack frame construction and register spilling; the calling function basically says "you can re-use the existing stack and not worry about my local variables."

But doesn't the stack frame construction overhead / register spilling only occur when the fallback path is actually invoked; therefore if it is unlikely and you care less about performance in this slow path it doesn't matter if you wrap the fallback path in musttail?

(Is this purely a branch prediction effect, where the possibility of extra work needing to be done slows down the fast path even when the work does not need to be done?)

By @turbolent - 8 months

GCC just got musttail too: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83324#c27

By @nu11ptr - 8 months

If added to Clang then I would assume this means it got support from LLVM. If so, does this mean that Rust can now rely on tail calls in LLVM to support something like the `become` keyword?

By @sp1rit - 8 months

I wonder how this behaves in combination with __attribute__((cleanup(...))). Especially if the to be cleaned variable is passed into the tail function as parameter.

By @ok123456 - 8 months

What happens if you throw an exception in a C++ [[musttail]] function? Is the exception stack completely separate?

By @chmaynard - 8 months