Do low-level optimizations matter? Faster quicksort with cmov (2020)
The article emphasizes the importance of low-level optimizations in sorting algorithms, highlighting how modern CPU features and minimizing conditional branches can enhance performance, particularly with the new `swap_if` primitive.
Read original articleThe article discusses the relevance of low-level optimizations in sorting algorithms, particularly in the context of modern computing hardware. It begins by highlighting the historical context of sorting research and the evolution of sorting algorithms. The author emphasizes that while sorting is a fundamental operation, the efficiency of sorting algorithms can be significantly affected by the underlying hardware architecture, which has changed dramatically over the years. The article presents various sorting implementations, including standard library sorts and radix sorts, comparing their performance in terms of execution time. It notes that modern CPUs have advanced features such as branch prediction and speculative execution, which can impact the performance of sorting algorithms, especially when dealing with random data. The author suggests that to optimize sorting algorithms for contemporary CPUs, it is essential to minimize conditional branches that depend on input data. This leads to the introduction of a new primitive, `swap_if`, which allows for conditional swapping without branching, thereby improving performance. The article concludes that understanding and adapting sorting algorithms to leverage modern CPU capabilities can lead to significant performance gains.
- Low-level optimizations in sorting algorithms can significantly impact performance.
- Modern CPUs have advanced features that affect how sorting algorithms perform.
- Minimizing conditional branches in sorting algorithms can enhance efficiency.
- The introduction of primitives like `swap_if` can help optimize sorting operations.
- Historical sorting research remains relevant as hardware continues to evolve.
Related
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
C++ Design Patterns for Low-Latency Applications
The article delves into C++ design patterns for low-latency applications, emphasizing optimizations for high-frequency trading. Techniques include cache prewarming, constexpr usage, loop unrolling, and hotpath/coldpath separation. It also covers comparisons, datatypes, lock-free programming, and memory access optimizations. Importance of code optimization is underscored.
How Clang compiles a function (2018)
The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.
Clang vs. Clang
The blog post critiques compiler optimizations in Clang, arguing they often introduce bugs and security vulnerabilities, diminish performance gains, and create timing channels, urging a reevaluation of current practices.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.
- There are corrections regarding the history of the CMOV instruction, clarifying its introduction by Intel in 1995 and later by AMD.
- Several commenters emphasize the limitations of the article's conclusions, particularly regarding the use of random data in performance tests.
- Discussions on the effectiveness of different sorting algorithms, including criticisms of the Lomuto partitioning method used in quicksort.
- Suggestions for alternative functions like std::select to improve performance without branches.
- Concerns about the relevance of the optimizations discussed for real-world applications and the importance of benchmarking.
CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.
Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.
Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.
Linus famously ranted about CMOV in https://yarchive.net/comp/linux/cmov.html (2007, so potentially more modern architectures are better at some of this) and he says: if you KNOW the branch is totally unpredictable, cmov is often good for
performance. But a compiler almost never knows that.
As usual with optimizations like this you have to benchmark and even then if your sample isn't representative it might not mean much.Wasn't "cmov" one of the things added for the pentium pro? So it wasn't instruction compatible - hence the "i686" prefix to a lot of compiler triples?
Something similar is coming to Rust as well: https://doc.rust-lang.org/nightly/core/intrinsics/fn.select_...
The interesting part is how that was done:
> A New Primitive, swap_if
> How can we use this method in our sort? First, let us make a swap_if:
inline bool swap_if(bool c, int& a, int& b) {
int ta = a, mask = -c; // false -> 0, true -> 111..111
a = (b & mask) | (ta & ~mask);
b = (ta & mask) | (b & ~mask);
return c;
}
> In our partition function, then, we can transform if (*right <= pivot) {
int tmp = *left; *left = *right, *right = tmp;
++left;
}
> into just left += swap_if(*right <= pivot, *left, *right);
The Lomuto partitioning is a naive vandalism of Hoare's original, which causes quicksort to have quadratic behavior on sorted input.
SIMDjson is another example that comes to mind. The conceit of C is that you do t have control over the underlying machine instructions without inlining it yourself. So how do people write C code with cpu optimizations like this?
Rethinking Binary Search: Improving on a Classic with AI Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU
The gist is rather simple: Assuming that a substiantial amount of your searches results in no result, you can bias the binary search to jump farther then just half, improving runtime noticeably.
If your primary audience is other developers then it absolutely does not matter. In fact it’s a way to expose deep anger. All that matters is convenience and measurements just piss people off, because measuring anything is just too hard.
If your primary audience is business, especially transactional finance, then absolutely yes. Squeeze absolutely every millisecond out, measure absolutely everything in terms of transaction quantity, focus on concurrent operations.
So the baseline radix sort can be even faster.
Related
Atomicless Per-Core Concurrency
The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.
C++ Design Patterns for Low-Latency Applications
The article delves into C++ design patterns for low-latency applications, emphasizing optimizations for high-frequency trading. Techniques include cache prewarming, constexpr usage, loop unrolling, and hotpath/coldpath separation. It also covers comparisons, datatypes, lock-free programming, and memory access optimizations. Importance of code optimization is underscored.
How Clang compiles a function (2018)
The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.
Clang vs. Clang
The blog post critiques compiler optimizations in Clang, arguing they often introduce bugs and security vulnerabilities, diminish performance gains, and create timing channels, urging a reevaluation of current practices.
CPU Dispatching: Make your code both portable and fast (2020)
CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.