September 1st, 2024

Creating invariant floating-point accumulators

The blog addresses challenges in creating invariant floating-point accumulators for the astcenc codec, emphasizing the need for consistency across SIMD instruction sets and the importance of adhering to IEEE754 rules.

Read original article

The blog discusses the challenges of creating invariant floating-point accumulators for the astcenc codec, which aims to ensure consistent output across different instruction sets like NEON, SSE4.1, and AVX2. The need for invariance arises from the inherent issues with floating-point arithmetic, where the order of operations can affect the final result due to precision variations. The author highlights several problems encountered during the implementation, including the differences in accumulation methods between scalar and vectorized code, variable-width accumulators, and loop tail handling. To address these issues, the author standardized on using fixed-width vector accumulators and adjusted the loop tail processing to maintain consistency across different SIMD widths. The blog emphasizes the importance of adhering to IEEE754 rules and avoiding compiler optimizations that could introduce variability. Additionally, it warns against using fast approximations and fused operations, which can compromise invariance. The author concludes that while achieving invariance may seem complex, careful attention to detail can lead to stable outputs.

- The astcenc codec aims for consistent output across various SIMD instruction sets.

- Floating-point arithmetic can introduce variability due to precision and order of operations.

- Standardizing on fixed-width accumulators helps maintain invariance in vectorized code.

- Compiler settings and optimizations can significantly impact floating-point determinism.

- Fast approximations and fused operations should be avoided to ensure consistent results.

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Do not taunt happy fun branch predictor

The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.

Summing ASCII encoded integers on Haswell at almost the speed of memcpy

Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.

tolower() with AVX-512

Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.

An SVE backend for astcenc (Adaptive Scalable Texture Compression Encoder)

The implementation of a 256-bit SVE backend for astcenc shows performance improvements of 14% to 63%, utilizing predicated operations and scatter/gather instructions, with future work planned for SVE2.

6 comments

By @boulos - 8 months

This seems to keep coming up, and I see confusion in the comments. There is a standard: IEEE 754-2008. There are additional things people add like approximate reciprocals and approximate sqrt. But if you don't use those, and you don't make an association error, you get consistent results.

The question here with association for summation is what you want to match. OP chose to match the scalar for-loop equivalent. You can just as easily make an 8-wide or 16-wide "virtual vector" and use that instead.

I suspect that an 8-wide virtual vector is the right default for people currently, since systems since Haswell support it, all recent AMD, and if you're using vectorization, you can afford to pay some overhead on Arm with a double-width virtual vector. You don't often gain enough from AVX512 to make the default 16-wide, but if you wanted to focus on Skylake+ (really Cascadelake+) or Genoa+ systems, it would be a fine choice.

By @kardos - 8 months

Exact floating point accumulating is more or less solved with xsum [1] -- would it work in this context?

[1] https://gitlab.com/radfordneal/xsum

By @waynecochran - 8 months

Invariance w floating point arithmetic seems like a fool's errand. If the numbers one is working with are roughly on the same order of magnitude than I would consider integer / fixed point instead. You get the same results in this case (as long as you are careful).

By @someguydave - 8 months

Seems crazy to try to paper over hardware implementation differences in software. Some org should be standardizing floating point intrinsics

By @baq - 8 months

Creating invariant floating-point accumulators