August 21st, 2024

Do low-level optimizations matter? Faster quicksort with cmov (2020)

The article emphasizes the importance of low-level optimizations in sorting algorithms, highlighting how modern CPU features and minimizing conditional branches can enhance performance, particularly with the new `swap_if` primitive.

Read original article

ConfusionDisagreementCuriosity

Do low-level optimizations matter? Faster quicksort with cmov (2020)

The article discusses the relevance of low-level optimizations in sorting algorithms, particularly in the context of modern computing hardware. It begins by highlighting the historical context of sorting research and the evolution of sorting algorithms. The author emphasizes that while sorting is a fundamental operation, the efficiency of sorting algorithms can be significantly affected by the underlying hardware architecture, which has changed dramatically over the years. The article presents various sorting implementations, including standard library sorts and radix sorts, comparing their performance in terms of execution time. It notes that modern CPUs have advanced features such as branch prediction and speculative execution, which can impact the performance of sorting algorithms, especially when dealing with random data. The author suggests that to optimize sorting algorithms for contemporary CPUs, it is essential to minimize conditional branches that depend on input data. This leads to the introduction of a new primitive, `swap_if`, which allows for conditional swapping without branching, thereby improving performance. The article concludes that understanding and adapting sorting algorithms to leverage modern CPU capabilities can lead to significant performance gains.

- Low-level optimizations in sorting algorithms can significantly impact performance.

- Modern CPUs have advanced features that affect how sorting algorithms perform.

- Minimizing conditional branches in sorting algorithms can enhance efficiency.

- The introduction of primitives like `swap_if` can help optimize sorting operations.

- Historical sorting research remains relevant as hardware continues to evolve.

Atomicless Per-Core Concurrency

The article explores atomicless concurrency for efficient allocator design, transitioning from per-thread to per-CPU structures on Linux. It details implementing CPU-local data structures using restartable sequences and rseq syscall, addressing challenges in Rust.

C++ Design Patterns for Low-Latency Applications

The article delves into C++ design patterns for low-latency applications, emphasizing optimizations for high-frequency trading. Techniques include cache prewarming, constexpr usage, loop unrolling, and hotpath/coldpath separation. It also covers comparisons, datatypes, lock-free programming, and memory access optimizations. Importance of code optimization is underscored.

How Clang compiles a function (2018)

The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.

Clang vs. Clang

The blog post critiques compiler optimizations in Clang, arguing they often introduce bugs and security vulnerabilities, diminish performance gains, and create timing channels, urging a reevaluation of current practices.

CPU Dispatching: Make your code both portable and fast (2020)

CPU dispatching improves software performance and portability by allowing binaries to select code versions based on CPU features at runtime, with manual and compiler-assisted approaches enhancing efficiency, especially using SIMD instructions.

AI: What people are saying

The comments on the article reflect a mix of technical insights and critiques regarding sorting algorithms and low-level optimizations.

There are corrections regarding the history of the CMOV instruction, clarifying its introduction by Intel in 1995 and later by AMD.
Several commenters emphasize the limitations of the article's conclusions, particularly regarding the use of random data in performance tests.
Discussions on the effectiveness of different sorting algorithms, including criticisms of the Lomuto partitioning method used in quicksort.
Suggestions for alternative functions like std::select to improve performance without branches.
Concerns about the relevance of the optimizations discussed for real-world applications and the importance of benchmarking.

15 comments

By @adrian_b - 9 months

A correction to the history mentioned in the article: CMOV has not been added by AMD around 2000.

CMOV has been added by Intel in 1995, to Pentium Pro. It was the most important addition to the x86 ISA added by Pentium Pro. So CMOV was supported by Pentium Pro and its successors, like Pentium 2 or Pentium III, but it was not supported by Pentium, Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with Athlon, in 1999.

Pentium Pro was the first Intel CPU with out-of-order execution and it did much more speculative execution than its predecessors. Therefore it had the need for CMOV to limit the performance loss caused by branch mispredictions.

By @dgl - 9 months

The most important bit of this is in the conclusion:

  Before we conclude anything, we should remind ourselves of its limitations. The tests run were on completely random data. Truly random data seldom occurs in real life.

Linus famously ranted about CMOV in https://yarchive.net/comp/linux/cmov.html (2007, so potentially more modern architectures are better at some of this) and he says:

  if you KNOW the branch is totally unpredictable, cmov is often good for
  performance. But a compiler almost never knows that.

As usual with optimizations like this you have to benchmark and even then if your sample isn't representative it might not mean much.

By @kimixa - 9 months

> Back in 2000, AMD included cmov in its 64-bit x86 ISA extensions. Then, Intel had to adopt them when Itanium flopped.

Wasn't "cmov" one of the things added for the pentium pro? So it wasn't instruction compatible - hence the "i686" prefix to a lot of compiler triples?

By @orlp - 9 months

Rather than std::swap_if which is rather niche, I would prefer to see std::select(cond, if_true, if_false), with the guarantee that unlike the ternary operator it eagerly evaluates both arguments and selects between them branchlessly if possible.

Something similar is coming to Rust as well: https://doc.rust-lang.org/nightly/core/intrinsics/fn.select_...

By @karmakaze - 9 months

I would expect eliminating branches in a busy inner loop to matter.

The interesting part is how that was done:

> A New Primitive, swap_if

> How can we use this method in our sort? First, let us make a swap_if:

  inline bool swap_if(bool c, int& a, int& b) {
    int ta = a, mask = -c;  // false -> 0, true -> 111..111
    a = (b & mask) | (ta & ~mask);
    b = (ta & mask) | (b & ~mask);
    return c;
  }

> In our partition function, then, we can transform

    if (*right <= pivot) {
      int tmp = *left; *left = *right, *right = tmp;
      ++left;
    }

> into just

    left += swap_if(*right <= pivot, *left, *right);

By @kazinator - 9 months

> I call this bog-standard, but the quicksorts most easily found online are always oddly pessimized, both more complicated and slower. This one uses the “Lomuto partition”, which is simpler than Hoare’s.

The Lomuto partitioning is a naive vandalism of Hoare's original, which causes quicksort to have quadratic behavior on sorted input.

By @memset - 9 months

Fascinating! How does one learn to write C code that will make use of specific asm instructions?

SIMDjson is another example that comes to mind. The conceit of C is that you do t have control over the underlying machine instructions without inlining it yourself. So how do people write C code with cpu optimizations like this?

By @lynx23 - 9 months

Heh, nice! This article reminded me of a great talk I recently noticed:

Rethinking Binary Search: Improving on a Classic with AI Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU

The gist is rather simple: Assuming that a substiantial amount of your searches results in no result, you can bias the binary search to jump farther then just half, improving runtime noticeably.

By @austin-cheney - 9 months

Matter to whom?

If your primary audience is other developers then it absolutely does not matter. In fact it’s a way to expose deep anger. All that matters is convenience and measurements just piss people off, because measuring anything is just too hard.

If your primary audience is business, especially transactional finance, then absolutely yes. Squeeze absolutely every millisecond out, measure absolutely everything in terms of transaction quantity, focus on concurrent operations.

By @pcwalton - 9 months

Note that random data is not a common case for sorting algorithms. It'd be interesting to see how the numbers change on partially-, mostly-, and fully-sorted data.

By @MatthiasPortzel - 9 months

It’s weird to admit that this is an arbitrary dataset that doesn’t mimic real-world use cases, then admit that radix sort is far faster for this situation, then conclude that low-level optimizations are important and a cmov-wrapper should be added to the standard library to make programs faster. It’s a great post I just disagree with the conclusion.

By @pronoiac - 9 months

Here's the article linked as https; I'll email the mods.

https://cantrip.org/sortfast.html

By @xiaodai - 9 months

The radix sort implementation is not optimal. Instead of sorting 1 digit at a time, it should be sorting 11 digits at a time to saturate the cache lines.

So the baseline radix sort can be even faster.

By @mbroncano - 9 months

(2020)

Atomicless Per-Core Concurrency

C++ Design Patterns for Low-Latency Applications

How Clang compiles a function (2018)

The article explains how Clang compiles a simple C++ function into LLVM IR, detailing memory allocation, control flow, and the representation of loops, with a focus on unoptimized output.

Do low-level optimizations matter? Faster quicksort with cmov (2020)

Related

Atomicless Per-Core Concurrency

C++ Design Patterns for Low-Latency Applications

How Clang compiles a function (2018)

Clang vs. Clang

CPU Dispatching: Make your code both portable and fast (2020)

Related

Atomicless Per-Core Concurrency

C++ Design Patterns for Low-Latency Applications

How Clang compiles a function (2018)

Clang vs. Clang

CPU Dispatching: Make your code both portable and fast (2020)