August 21st, 2024

Do low-level optimizations matter? Faster quicksort with cmov (2020)

The article emphasizes the importance of low-level optimizations in sorting algorithms, highlighting how modern CPU features and minimizing conditional branches can enhance performance, particularly with the new `swap_if` primitive.

Read original articleLink Icon
ConfusionDisagreementCuriosity
Do low-level optimizations matter? Faster quicksort with cmov (2020)

The article discusses the relevance of low-level optimizations in sorting algorithms, particularly in the context of modern computing hardware. It begins by highlighting the historical context of sorting research and the evolution of sorting algorithms. The author emphasizes that while sorting is a fundamental operation, the efficiency of sorting algorithms can be significantly affected by the underlying hardware architecture, which has changed dramatically over the years. The article presents various sorting implementations, including standard library sorts and radix sorts, comparing their performance in terms of execution time. It notes that modern CPUs have advanced features such as branch prediction and speculative execution, which can impact the performance of sorting algorithms, especially when dealing with random data. The author suggests that to optimize sorting algorithms for contemporary CPUs, it is essential to minimize conditional branches that depend on input data. This leads to the introduction of a new primitive, `swap_if`, which allows for conditional swapping without branching, thereby improving performance. The article concludes that understanding and adapting sorting algorithms to leverage modern CPU capabilities can lead to significant performance gains.

- Low-level optimizations in sorting algorithms can significantly impact performance.

- Modern CPUs have advanced features that affect how sorting algorithms perform.

- Minimizing conditional branches in sorting algorithms can enhance efficiency.

- The introduction of primitives like `swap_if` can help optimize sorting operations.

- Historical sorting research remains relevant as hardware continues to evolve.

AI: What people are saying
The comments on the article reflect a mix of technical insights and critiques regarding sorting algorithms and low-level optimizations.
  • There are corrections regarding the history of the CMOV instruction, clarifying its introduction by Intel in 1995 and later by AMD.
  • Several commenters emphasize the limitations of the article's conclusions, particularly regarding the use of random data in performance tests.
  • Discussions on the effectiveness of different sorting algorithms, including criticisms of the Lomuto partitioning method used in quicksort.
  • Suggestions for alternative functions like std::select to improve performance without branches.
  • Concerns about the relevance of the optimizations discussed for real-world applications and the importance of benchmarking.
Link Icon 15 comments
By @adrian_b - 3 months
A correction to the history mentioned in the article: CMOV has not been added by AMD around 2000.

CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.

Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.

By @dgl - 3 months
The most important bit of this is in the conclusion:

  Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.
Linus famously ranted about CMOV in https://yarchive.net/comp/linux/cmov.html (2007, so potentially more modern architectures are better at some of this) and he says:

  if you KNOW the branch is totally unpredictable, cmov is often good for
  performance. But a compiler almost never knows that.
As usual with optimizations like this you have to benchmark and even then if your sample isn't representative it might not mean much.
By @kimixa - 3 months
> Back in 2000, AMD included cmov in its 64-bit x86 ISA extensions. Then, Intel had to adopt them when Itanium flopped.

Wasn't "cmov" one of the things added for the pentium pro? So it wasn't instruction compatible - hence the "i686" prefix to a lot of compiler triples?

By @orlp - 3 months
Rather than std::swap_if which is rather niche, I would prefer to see std::select(cond, if_true, if_false), with the guarantee that unlike the ternary operator it eagerly evaluates both arguments and selects between them branchlessly if possible.

Something similar is coming to Rust as well: https://doc.rust-lang.org/nightly/core/intrinsics/fn.select_...

By @karmakaze - 3 months
I would expect eliminating branches in a busy inner loop to matter.

The interesting part is how that was done:

> A New Primitive, swap_if

> How can we use this method in our sort? First, let us make a swap_if:

  inline bool swap_if(bool c, int& a, int& b) {
    int ta = a, mask = -c;  // false -> 0, true -> 111..111
    a = (b & mask) | (ta & ~mask);
    b = (ta & mask) | (b & ~mask);
    return c;
  }
> In our partition function, then, we can transform

    if (*right <= pivot) {
      int tmp = *left; *left = *right, *right = tmp;
      ++left;
    }
> into just

    left += swap_if(*right <= pivot, *left, *right);
By @kazinator - 3 months
> I call this bog-standard, but the quicksorts most easily found online are always oddly pessimized, both more complicated and slower. This one uses the “Lomuto partition”, which is simpler than Hoare’s.

The Lomuto partitioning is a naive vandalism of Hoare's original, which causes quicksort to have quadratic behavior on sorted input.

By @memset - 3 months
Fascinating! How does one learn to write C code that will make use of specific asm instructions?

SIMDjson is another example that comes to mind. The conceit of C is that you do t have control over the underlying machine instructions without inlining it yourself. So how do people write C code with cpu optimizations like this?

By @lynx23 - 3 months
Heh, nice! This article reminded me of a great talk I recently noticed:

Rethinking Binary Search: Improving on a Classic with AI Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU

The gist is rather simple: Assuming that a substiantial amount of your searches results in no result, you can bias the binary search to jump farther then just half, improving runtime noticeably.

By @austin-cheney - 3 months
Matter to whom?

If your primary audience is other developers then it absolutely does not matter. In fact it’s a way to expose deep anger. All that matters is convenience and measurements just piss people off, because measuring anything is just too hard.

If your primary audience is business, especially transactional finance, then absolutely yes. Squeeze absolutely every millisecond out, measure absolutely everything in terms of transaction quantity, focus on concurrent operations.

By @pcwalton - 3 months
Note that random data is not a common case for sorting algorithms. It'd be interesting to see how the numbers change on partially-, mostly-, and fully-sorted data.
By @MatthiasPortzel - 3 months
It’s weird to admit that this is an arbitrary dataset that doesn’t mimic real-world use cases, then admit that radix sort is far faster for this situation, then conclude that low-level optimizations are important and a cmov-wrapper should be added to the standard library to make programs faster. It’s a great post I just disagree with the conclusion.
By @pronoiac - 3 months
Here's the article linked as https; I'll email the mods.

https://cantrip.org/sortfast.html

By @xiaodai - 3 months
The radix sort implementation is not optimal. Instead of sorting 1 digit at a time, it should be sorting 11 digits at a time to saturate the cache lines.

So the baseline radix sort can be even faster.

By @mbroncano - 3 months
(2020)