June 22nd, 2024

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Read original article

Neil Henning discusses a quirk in clang related to using the sqrtps intrinsic in C/C++ code. When compiling with -ffast-math, clang can produce inefficient code due to precision issues between Intel and AMD processors. To ensure the desired instruction selection, Henning suggests using inline assembly. However, he highlights a drawback where constant folding might not occur if the function is inlined. To address this, he introduces a workaround using the __builtin_constant_p function to check for constants and optimize the code accordingly. Despite some limitations with GCC's behavior, Henning demonstrates how to achieve constant folding by checking individual vector elements. By implementing this approach, the code can be optimized effectively, ensuring the desired behavior even when inlined. Henning concludes by suggesting potential improvements to the __builtin_constant_p function for better compatibility across compilers.

As you learn Forth, it learns from you (1981)

The Forth programming language is highlighted for its unique features like extensibility, speed, and efficiency. Contrasted with Basic, Forth's threaded code system and data handling methods make it versatile.

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Identifying Leap Years (2020)

David Turner explores optimizing leap year calculations for performance gains by using bitwise operations and integer bounds. He presents efficient methods, mathematical proofs, and considerations for signed integers, highlighting limitations pre-Gregorian calendar.

Finnish startup says it can speed up any CPU by 100x

A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.

Optimizing the Roc parser/compiler with data-oriented design

The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.

22 comments

By @ComputerGuru - 10 months

This reminds me of an issue that I ran into with rust [0] when I was trying to optimize some machine learning code. Rust's über-strict type-safe math operations have you use matching types to get the euclidean non-negative remainder of x mod y. When you have a float-point x value but an integral y value, the operation can be performed much more cheaply than when y is also a floating point value.

The problem is that you end up promoting y from an integer to an f64 and get a much slower operation. I ended up writing my own `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some ~35x faster (a huge win when crunching massive arrays), but as there are range limitations (since f64::MAX > i64::MAX) you can't naively replace all call sites based on the type signatures. However, with some support from the compiler it would be completely doable anytime the compiler is able to infer an upper/lower bound on the f64 dividend, when the result of the operation is coerced to an integer afterwards, or when the dividend is a constant value that doesn't exceed that range.

So now I copy that function around from ML project to ML project, because what else can I do?

(A “workaround” was to use a slower-but-still-faster `rem_i128(self: &f64, divisor: i128) -> f64` to raise the functional limits of the operation, but you're never going to match the range of a 64-bit floating point value until you use 512-bit integral math!)

[0]: https://github.com/rust-lang/rust/issues/83973

Godbolt link: https://godbolt.org/z/EqrEqExnc

By @eqvinox - 10 months

Pet peeve: this isn't "your own constant folder in C/C++"… it's "your own enabling constant folding in C/C++"…

With "own constant folder" I expected a GCC/clang plugin that does tree/control flow analysis and some fancy logic in order to determine when, where and how to constant fold…

By @noitpmeder - 10 months

This seems like a great example of why people don't like C/C++, and probably a good example of why some people _do_ like it.

How is a non-expert in the language supposed to learn tricks/... things like this? I'm asking as a C++ developer of 6+ years in high performance settings, most of this article is esoteric to me.

By @GrantMoyer - 10 months

I think the real problem is "really really wanted the sqrtps to be used in some code they were writing" is at odds with -ffast-math.

Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is set because it's often faster (See [1] section 15.12). It isn't faster for some architectures, but if you tell clang what architecture you're targeting (with -mtune), it appears to make the right choice for the architecture[2].

[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...

[2]: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...

By @Lockal - 10 months

For C++ this goes to the question https://stackoverflow.com/questions/8936549/constexpr-overlo...

Back in the days the answer was `__builtin_constant_p`.

But with C++20 it is possible to use std::is_constant_evaluated, or `if consteval` with C++23.

But this is for scenario when you still want to keep high quality of code (maybe when you write multiprecision math library; not when you hack around compiler flag), which inline assembly violates for many reasons:

1) major issue: instead of dealing with `-ffast-math` with inline asm, just remove `-ffast-math`

2) randomly slapped inline asm inside normal fp32 computations breaks autovectorization

3) randomly slapped inline asm in example uses non-vex encoding, you will likely to forget to call vzeroupper on transition. Or in general, this limits code to x86 (forget about x86-64/arm)

4) provided example (best_always_inline) does not work in GCC as expected

By @omoikane - 10 months

Offtopic, but the original title is "Your Own Constant Folder in C/C++". I am guessing Hacker News cut out the "your" for some reason, but at first glance I thought it was some kind of read-only directory implemented in C/C++. It's like a truncated garden path sentence.

https://en.wikipedia.org/wiki/Garden-path_sentence

By @pclmulqdq - 10 months

Another option is to use "-march" to set your target architecture to something post-skylake/Zen 2. That should emit the right instruction.

The square root and div instructions used to be a lot slower than they are now.

By @btdmaster - 10 months

Fortunately gcc does not miscompile like this: https://godbolt.org/z/PaeaohqGz

By @mgaunard - 10 months

The real fix is to not use -ffast-math. It breaks all sorts of things by design.

If you really want to relax the correctness of your code to get some potential speedup, either do the optimizations yourself instead of letting the compiler do them, or locally enable fast-math equivalents.

As for the is constant expression GCC extensions, that stuff is natively available in standard C++ nowadays.

By @hansvm - 10 months

Forced constant folding is something I use on occasion in Zig. To declare that a computation needs to happen at compile-time, just prefix it with `comptime`.

Compile-time execution emulates the target architecture, which has pros and cons. The most notable con is that inline assembly isn't supported (yet?). The complete solution to this particular problem then additionally requires `return if (isComptime()) normal_code() else assembly_magic();` (which has no runtime cost because that branch is constant-folded). Given that inline assembly usually also branches on target architecture, that winds up not adding much complexity -- especially given that you probably had exactly that same code as a fallback for architectures you didn't explicitly handle.

By @feverzsj - 10 months

clang has __builtin_elementwise_sqrt, which meets all your requirements.

By @fwsgonzo - 10 months

Very interesting, and that constant folding trick is handy!

By @metadat - 10 months

Is `_mm_sqrt_ps(..)' an Intel-only thing? Why is the naming so jacked?

https://www.intel.com/content/www/us/en/docs/cpp-compiler/de...

By @ComputerGuru - 10 months

> but I think the LLVM folks have purposefully tried to match what GCC would do.

I never got that impression in my perusal of LLVM bug reports and patches. I wonder if there is an open issue for this specific case.

By @dooglius - 10 months

I have not found __builtin_constant_p to be very reliable when I want to fold multiple times. Is there any way to do this trick better using c++ constexpr I wonder?

By @white_beach - 10 months

somewhat related to this:

https://godbolt.org/z/1543TYszP

(intel c++ compiler)

By @bibouthegreat - 10 months

I wonder what it would be with a constexpr. If it works, it would be good to add this to intrinsic functions

By @pxmpxm - 10 months

Huh, why would reciprocal sqrt and fpmul be faster than regular sqrt?

By @edelsohn - 10 months

__builtin_constant_p(vec) is not inquiring if the contents of vec is constant. The compilers are not being fickle. The statement is not performing the question that the developer intended.

By @SAI_Peregrinus - 10 months

> if you happened to use -ffast-math

That option is terribly named. It should be -fincorrect_math_that_might_sometimes_be_faster.

By @kookamamie - 10 months

It's just so fiddly - can't trust the compiler, not a good sign for a language or its implementation.

Own Constant Folder in C/C++

Related

As you learn Forth, it learns from you (1981)

My experience crafting an interpreter with Rust (2021)

Identifying Leap Years (2020)

Finnish startup says it can speed up any CPU by 100x

Optimizing the Roc parser/compiler with data-oriented design

Related

As you learn Forth, it learns from you (1981)

My experience crafting an interpreter with Rust (2021)

Identifying Leap Years (2020)

Finnish startup says it can speed up any CPU by 100x

Optimizing the Roc parser/compiler with data-oriented design