Own Constant Folder in C/C++
Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.
Read original articleNeil Henning discusses a quirk in clang related to using the sqrtps intrinsic in C/C++ code. When compiling with -ffast-math, clang can produce inefficient code due to precision issues between Intel and AMD processors. To ensure the desired instruction selection, Henning suggests using inline assembly. However, he highlights a drawback where constant folding might not occur if the function is inlined. To address this, he introduces a workaround using the __builtin_constant_p function to check for constants and optimize the code accordingly. Despite some limitations with GCC's behavior, Henning demonstrates how to achieve constant folding by checking individual vector elements. By implementing this approach, the code can be optimized effectively, ensuring the desired behavior even when inlined. Henning concludes by suggesting potential improvements to the __builtin_constant_p function for better compatibility across compilers.
Related
As you learn Forth, it learns from you (1981)
The Forth programming language is highlighted for its unique features like extensibility, speed, and efficiency. Contrasted with Basic, Forth's threaded code system and data handling methods make it versatile.
My experience crafting an interpreter with Rust (2021)
Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.
Identifying Leap Years (2020)
David Turner explores optimizing leap year calculations for performance gains by using bitwise operations and integer bounds. He presents efficient methods, mathematical proofs, and considerations for signed integers, highlighting limitations pre-Gregorian calendar.
Finnish startup says it can speed up any CPU by 100x
A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.
Optimizing the Roc parser/compiler with data-oriented design
The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.
The problem is that you end up promoting y from an integer to an f64 and get a much slower operation. I ended up writing my own `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some ~35x faster (a huge win when crunching massive arrays), but as there are range limitations (since f64::MAX > i64::MAX) you can't naively replace all call sites based on the type signatures. However, with some support from the compiler it would be completely doable anytime the compiler is able to infer an upper/lower bound on the f64 dividend, when the result of the operation is coerced to an integer afterwards, or when the dividend is a constant value that doesn't exceed that range.
So now I copy that function around from ML project to ML project, because what else can I do?
(A “workaround” was to use a slower-but-still-faster `rem_i128(self: &f64, divisor: i128) -> f64` to raise the functional limits of the operation, but you're never going to match the range of a 64-bit floating point value until you use 512-bit integral math!)
[0]: https://github.com/rust-lang/rust/issues/83973
Godbolt link: https://godbolt.org/z/EqrEqExnc
With "own constant folder" I expected a GCC/clang plugin that does tree/control flow analysis and some fancy logic in order to determine when, where and how to constant fold…
How is a non-expert in the language supposed to learn tricks/... things like this? I'm asking as a C++ developer of 6+ years in high performance settings, most of this article is esoteric to me.
Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is set because it's often faster (See [1] section 15.12). It isn't faster for some architectures, but if you tell clang what architecture you're targeting (with -mtune), it appears to make the right choice for the architecture[2].
[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...
[2]: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
Back in the days the answer was `__builtin_constant_p`.
But with C++20 it is possible to use std::is_constant_evaluated, or `if consteval` with C++23.
But this is for scenario when you still want to keep high quality of code (maybe when you write multiprecision math library; not when you hack around compiler flag), which inline assembly violates for many reasons:
1) major issue: instead of dealing with `-ffast-math` with inline asm, just remove `-ffast-math`
2) randomly slapped inline asm inside normal fp32 computations breaks autovectorization
3) randomly slapped inline asm in example uses non-vex encoding, you will likely to forget to call vzeroupper on transition. Or in general, this limits code to x86 (forget about x86-64/arm)
4) provided example (best_always_inline) does not work in GCC as expected
The square root and div instructions used to be a lot slower than they are now.
If you really want to relax the correctness of your code to get some potential speedup, either do the optimizations yourself instead of letting the compiler do them, or locally enable fast-math equivalents.
As for the is constant expression GCC extensions, that stuff is natively available in standard C++ nowadays.
Compile-time execution emulates the target architecture, which has pros and cons. The most notable con is that inline assembly isn't supported (yet?). The complete solution to this particular problem then additionally requires `return if (isComptime()) normal_code() else assembly_magic();` (which has no runtime cost because that branch is constant-folded). Given that inline assembly usually also branches on target architecture, that winds up not adding much complexity -- especially given that you probably had exactly that same code as a fallback for architectures you didn't explicitly handle.
https://www.intel.com/content/www/us/en/docs/cpp-compiler/de...
I never got that impression in my perusal of LLVM bug reports and patches. I wonder if there is an open issue for this specific case.
https://godbolt.org/z/1543TYszP
(intel c++ compiler)
That option is terribly named. It should be -fincorrect_math_that_might_sometimes_be_faster.
Related
As you learn Forth, it learns from you (1981)
The Forth programming language is highlighted for its unique features like extensibility, speed, and efficiency. Contrasted with Basic, Forth's threaded code system and data handling methods make it versatile.
My experience crafting an interpreter with Rust (2021)
Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.
Identifying Leap Years (2020)
David Turner explores optimizing leap year calculations for performance gains by using bitwise operations and integer bounds. He presents efficient methods, mathematical proofs, and considerations for signed integers, highlighting limitations pre-Gregorian calendar.
Finnish startup says it can speed up any CPU by 100x
A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.
Optimizing the Roc parser/compiler with data-oriented design
The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.