Creating invariant floating-point accumulators
The blog addresses challenges in creating invariant floating-point accumulators for the astcenc codec, emphasizing the need for consistency across SIMD instruction sets and the importance of adhering to IEEE754 rules.
Read original articleThe blog discusses the challenges of creating invariant floating-point accumulators for the astcenc codec, which aims to ensure consistent output across different instruction sets like NEON, SSE4.1, and AVX2. The need for invariance arises from the inherent issues with floating-point arithmetic, where the order of operations can affect the final result due to precision variations. The author highlights several problems encountered during the implementation, including the differences in accumulation methods between scalar and vectorized code, variable-width accumulators, and loop tail handling. To address these issues, the author standardized on using fixed-width vector accumulators and adjusted the loop tail processing to maintain consistency across different SIMD widths. The blog emphasizes the importance of adhering to IEEE754 rules and avoiding compiler optimizations that could introduce variability. Additionally, it warns against using fast approximations and fused operations, which can compromise invariance. The author concludes that while achieving invariance may seem complex, careful attention to detail can lead to stable outputs.
- The astcenc codec aims for consistent output across various SIMD instruction sets.
- Floating-point arithmetic can introduce variability due to precision and order of operations.
- Standardizing on fixed-width accumulators helps maintain invariance in vectorized code.
- Compiler settings and optimizations can significantly impact floating-point determinism.
- Fast approximations and fused operations should be avoided to ensure consistent results.
Related
Own Constant Folder in C/C++
Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
tolower() with AVX-512
Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.
An SVE backend for astcenc (Adaptive Scalable Texture Compression Encoder)
The implementation of a 256-bit SVE backend for astcenc shows performance improvements of 14% to 63%, utilizing predicated operations and scatter/gather instructions, with future work planned for SVE2.
The question here with association for summation is what you want to match. OP chose to match the scalar for-loop equivalent. You can just as easily make an 8-wide or 16-wide "virtual vector" and use that instead.
I suspect that an 8-wide virtual vector is the right default for people currently, since systems since Haswell support it, all recent AMD, and if you're using vectorization, you can afford to pay some overhead on Arm with a double-width virtual vector. You don't often gain enough from AVX512 to make the default 16-wide, but if you wanted to focus on Skylake+ (really Cascadelake+) or Genoa+ systems, it would be a fine choice.
It seems to me that non associative floating point operations force us into a local maximum. The operation itself might be efficient on modern machines, but could it be preventing us from applying other important high level optimizations to our programs due to its lack of associativity? A richer algebraic structure should always be amenable to a richer set of potential optimizations.
---
I've asked a question that is very much related to that topic on the programming language subreddit:
"Could numerical operations be optimized by using algebraic properties that are not present in floating point operations but in numbers that have infinite precision?"
https://www.reddit.com/r/ProgrammingLanguages/comments/145kp...
The responses there might be interesting to some people here.
Related
Own Constant Folder in C/C++
Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Summing ASCII encoded integers on Haswell at almost the speed of memcpy
Matt Stuchlik presents a high-performance algorithm for summing ASCII-encoded integers on Haswell systems. It utilizes SIMD instructions, lookup tables, and efficient operations to achieve speed enhancements, showcasing innovative approaches in integer sum calculations.
tolower() with AVX-512
Tony Finch's blog post details the implementation of the tolower() function using AVX-512-BW SIMD instructions, optimizing string processing and outperforming standard methods, particularly for short strings.
An SVE backend for astcenc (Adaptive Scalable Texture Compression Encoder)
The implementation of a 256-bit SVE backend for astcenc shows performance improvements of 14% to 63%, utilizing predicated operations and scatter/gather instructions, with future work planned for SVE2.