Parsing protobuf at 2+GB/s: how I learned to love tail calls in C (2021)
The Clang compiler's `musttail` attribute ensures tail call optimization, enhancing performance in C-based interpreters and parsers, particularly improving Protocol Buffers parsing speed to over 2GB/s.
Read original articleThe article discusses the introduction of the `musttail` attribute in the Clang compiler, which guarantees tail call optimization in C, C++, and Objective-C. This optimization allows for significant performance improvements, particularly in parsing Protocol Buffers (protobuf) at speeds exceeding 2GB/s, more than double previous benchmarks. The author emphasizes that while tail calls are a crucial component of this speedup, they are not the sole factor. Tail calls reduce stack memory usage and eliminate the overhead of traditional function calls, enabling more efficient algorithm designs. The article also highlights the challenges faced by C compilers in optimizing interpreter loops, which can lead to suboptimal performance. By adopting a tail call-oriented design, the authors were able to create a more efficient protobuf parser that maintains high performance without resorting to assembly language. This design involves breaking down the parsing process into smaller functions, each responsible for a specific operation, which allows for better control over register allocation and code optimization. The article concludes by noting that while the `musttail` feature is not yet widely supported, its availability in Clang represents a significant step toward enhancing performance in C-based interpreters and parsers.
- The `musttail` attribute in Clang guarantees tail call optimization for better performance.
- Tail calls can significantly improve the speed of protobuf parsing, achieving over 2GB/s.
- A tail call-oriented design allows for efficient parsing without using assembly language.
- Smaller, specialized functions improve register allocation and code optimization.
- The `musttail` feature is not yet widely supported, limiting its immediate applicability.
Related
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Tail Recursion for Macros in C
The blog discusses tail recursion for macros in C, introducing __VA_TAIL__() to enable real recursion, overcoming inefficiencies of traditional recursive macro calls. It showcases how tail recursion improves processing efficiency in macros.
How Clang compiles a function (2018)
The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.
Better Firmware with LLVM/Clang
LLVM and Clang are gaining traction in embedded software development, particularly for ARM Cortex-M devices. The article advocates integrating Clang for better static analysis, error detection, and dual compiler usage.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.
- There is interest in standardizing tail call optimization across various compilers, with ongoing discussions about its implementation in GCC.
- Comments highlight the performance benefits of tail call optimization in interpreters and the potential for improved parsing speeds, particularly in Protocol Buffers.
- Some users express curiosity about the interaction of `musttail` with other language features and optimizations, such as cleanup attributes and computed gotos.
- There are references to similar proposals in other languages, like Rust, indicating a broader interest in tail call optimization beyond C.
- Several commenters emphasize the importance of understanding how to write code that facilitates compiler optimizations, regardless of specific language features.
What I like about it, over standardizing [[musttail]], is that lifetimes of local objects are guaranteed to end. This makes it possible to implement without extensive escape analysis.
[0] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3266.htm#5...
[0]: https://github.com/rust-lang/rfcs/pull/1888 [1]: https://github.com/rust-lang/rfcs/pull/3407
Also, the main reason why interpreters get faster using either the computed goto style or the tail style versus a classic switch loop is that it reduces pressure on the branch predictor since there’s statically one indirect branch per opcode rather than statically just one indirect branch.
See the luajit remake blog for an exhaustive analysis and alternative using an intermediate compiler https://sillycross.github.io/2022/11/22/2022-11-22/
AFAIK, attribute musttail is in the process of being added to GCC (the patch is under review) with semantics compatible with clang.
foo()
{
auto a = SomeClassWithADestructor();
return bar(); // This is not a tail-call because the destruction of a happens *after* the call to bar();
}
Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?
To me it seems like a structure of the form
<read data>
if (unlikely(malformed))
return error();
<prep more stuff>
switch (data_type) {
case x:
return handle_x();
case y:
return handle_y();
}
generates a nice jump table quite reliably.* Since this library is only offered as a binding to some managed languages there are some extra caveats that, in term of performance overshadow everything else. I cannot speak to Ruby or PHP, but with Python I saw a dramatic speed increase when not using enumerators. I.e. if you translate Protobuf's enumerators into Python's enumerators any possible gains you may have in your C code will be trampled by creation times of different Python objects. The difference is many orders of magnitude. Similarly, you could go even further and implement all the supporting data-structures in C, and only expose minimal interface to them in Python. How fair would this sort of comparison be against code that uses Python's built-in structures is the question I cannot answer.
* Google's Protobuf parser for Python would be still "faster" than 2+GB/s because... it doesn't parse anything beside the top-level message. The internal structure of the message is parsed on-demand. This will be probably slower than 2+GB/s if your code instantly reads everything that was parsed, but the obvious question would be: how can you possibly compare these two approaches in practical terms? And, again, there isn't a clear answer, because the practical results will depend on the nature of your application.
* In general case, Protobuf parsing cannot be streamed (because of the handling of duplicates). This means, that, in practice, code that parses Protobuf content will be bottlenecked on I/O (it needs to wait for the end of the message before any parsing starts). Independently, depending on the typical Probobuf message in your application, it might be possible to parallelize parsing, which, again, will most likely outrun any single-threaded parser, but, just as examples above, cannot be said to be a winning strategy in general.
* It's usually a lot more efficient to combine parsing with creation of domain objects, which is a step an application will almost always have to take anyways. How this functionality is accessible from the parser will in many cases determine which parser will win the race.
----
Bottom line: Protobuf (or maybe parser in general) is just a bad candidate to try to measure / compare speeds. It's too low-level and poorly designed to be a good measuring stick for performance benchmarks.
[1] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html [2] https://clang.llvm.org/docs/ClangCommandLineReference.html#t...
My understanding is that musttail is useful because it prevents stack frame construction and register spilling; the calling function basically says "you can re-use the existing stack and not worry about my local variables."
But doesn't the stack frame construction overhead / register spilling only occur when the fallback path is actually invoked; therefore if it is unlikely and you care less about performance in this slow path it doesn't matter if you wrap the fallback path in musttail?
(Is this purely a branch prediction effect, where the possibility of extra work needing to be done slows down the fast path even when the work does not need to be done?)
I think the work on C23 (which would have still been C2x when this article was written), means people are starting to want to see innovation in the language again. I don't think that would have happened without Go and Rust showing some great thinking, but it's interesting that it's happening at all.
Tail calls are an interesting idea, and now I want to know more about both this extension, and also how to write my code in such a way as to help a compiler optimize for tail calls even when this extension isn't present. This somehow feels more fun to me than writing more Python or Ruby...
Related
Some Tricks from the Scrapscript Compiler
The Scrapscript compiler implements optimization tricks like immediate objects, small strings, and variants for better performance. It introduces immediate variants and const heap to enhance efficiency without complexity, seeking suggestions for future improvements.
Tail Recursion for Macros in C
The blog discusses tail recursion for macros in C, introducing __VA_TAIL__() to enable real recursion, overcoming inefficiencies of traditional recursive macro calls. It showcases how tail recursion improves processing efficiency in macros.
How Clang compiles a function (2018)
The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.
Better Firmware with LLVM/Clang
LLVM and Clang are gaining traction in embedded software development, particularly for ARM Cortex-M devices. The article advocates integrating Clang for better static analysis, error detection, and dual compiler usage.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.