July 11th, 2024

Beating the L1 cache with value speculation (2021)

Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.

Read original articleLink Icon
Beating the L1 cache with value speculation (2021)

Value speculation is a technique that leverages the branch predictor to guess values, enabling more instruction parallelism and removing bottlenecks on the L1 cache. By guessing the next node in a linked list, the CPU can avoid waiting for data dependencies, significantly improving performance. This technique is demonstrated through a real-world example on a Xeon E5-1650 v3 processor, showing a throughput increase from 14GB/s to 30GB/s for certain datasets. The trick exploits the fact that nodes in a linked list are usually contiguous in memory, allowing for speculative value guessing. By reducing the impact of data dependencies and utilizing the branch predictor, the CPU can achieve higher throughput rates by avoiding delays related to reading from the L1 cache. This optimization showcases the intricate workings of modern CPUs in handling instructions in parallel and highlights the importance of understanding branch prediction and CPU cache behavior for performance enhancements in low-level code execution.

Link Icon 24 comments
By @Remnant44 - 6 months
It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

By @pkhuong - 6 months
I see a lot of people asking for a real use case. If you follow the reference chain in the first aside, you'll find this blog post of mine https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-poin.... where we use value speculation to keep MOVS out of the critical path in an interrupt-atomic read sequence for hazard pointers.
By @mistercow - 6 months
It’s interesting to me that the final assembly-trick-free version almost no longer looks like a hack.

If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.

By @metadat - 6 months
In case your knowledge of the mechanics of `struct' vs `typedef struct' in C are rusty like mine, here are nice refreshers:

https://stackoverflow.com/a/23660072

https://stackoverflow.com/a/1675446

By @kristianp - 6 months
Per Vognsen (referenced in this blog) is now found on Mastodon at : https://mastodon.social/@pervognsen

He's just published "Finding Simple Rewrite Rules for the JIT with Z3":

https://www.pypy.org/posts/2024/07/finding-simple-rewrite-ru...

https://news.ycombinator.com/item?id=40951900

By @12_throw_away - 6 months
This got me wondering - it's said that C is based on the lie that all computers have the same architecture as a PDP-11. (At least, I'm pretty sure I remember people saying that).

So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?

By @queuebert - 6 months
I appreciate the elegant blog design. Reminds me of Edward Tufte's books.
By @mwkaufma - 6 months
The optimization is the linear memory layout of the nodes -- value speculation is decoration.
By @WantonQuantum - 6 months
This is great! Mostly when I think about branch prediction, I'm thinking about the end of a loop so this was a great read.

There have been a lot of comments about the example presented being quite artificial and I agree but it is simple enough to help the reader understand what's happening and why it's faster.

In fact, it would be fairly common for the nodes in linked lists to be sequential in ram anyway. For example this code shows that the next node is easy to guess. The nodes do end up exactly in sequence in memory:

  #include <stdlib.h>
  #include <stdio.h>

  typedef struct Node {
    int value;
    struct Node *next;
  } Node;

  Node *head = NULL;
  Node *tail = NULL;

  int main(int argc, char **argv) {

    // Allocate some nodes
    for (int i = 0; i < 100; i++) {
      Node *new_node = malloc(sizeof(Node));
      new_node->value = rand();
      new_node->next = NULL;
      if (tail == NULL) {
        head = tail = new_node;
      } else {
        tail->next = new_node;
        tail = new_node;
      }
    }

    // Print their locations in memory
    for (Node *current = head; current->next != NULL; current = current-> next) {
      printf("%p\n", current);
    }
  }
By @zogrodea - 6 months
Value speculation is a neat trick. I was also surprised a low-level hack like this worked in a high-level language like OCaml.

https://news.ycombinator.com/item?id=35844078

By @PoignardAzur - 6 months
It's a neat trick, but I think a linked list (with the very specific layout where nodes are allocated in order) is the only situation where this trick could possibly be useful?

And I think it only works if Spectre mitigations are disabled anyway?

What the trick does is replace sequential fetches (where each fetch address depends on the result of the previous fetch because, well, linked lists) with parallel fetches. It takes the minimum fetch-to-fetch latency from a L1 cache hit (roughly 3 cycles IIRC) to a cycle or less (most CPUs can do multiple parallel fetches per cycle).

If your data is stored in a vector or a B-tree, accesses are already parallel by default and you'll never need this trick.

By @bee_rider - 6 months
Hmm. What do we think about the alternative guess that the address of the next node’s value field is the next address after our current node’s value field memory address? This is, I guess, essentially a guess that we’re pointing at sequential elements of a big array, which sort of begs the question “why not just use an array?” But I’m wonder if a slightly permuted array is not so unusual, or at least might come up occasionally.
By @BobbyTables2 - 6 months
The example is slightly contrived (the entire linked list was preallocated), but seems like the same technique could be a useful performance optimization, such as if successive calls to malloc commonly return pointers with a particular stride.
By @alfiedotwtf - 6 months
Ok… all the comments are pretty nitty gritty obscure. So unless you’re a compiler hacker or HFT assembly dev, where can someone like me learn all this stuff from (besides Intel/Arm manuals, even though the i386 manuals were nice)
By @oersted - 6 months
I enjoyed the read and it taught me new things, I just wish that the reference example would have some minimal practical value.

I don’t think there is any reasonable scenario where you would be using a linked list but the memory is contiguous most of the time.

By @flysand7 - 6 months
I may have misread the graphs, but I didn't see the article feature the comparison between the throughput when going over a fully-contiguous linked list vs. randomized linked list?
By @IshKebab - 6 months
Neat trick. Though it seems unlikely to be very useful in practice. How often are you going to know the probably value of a pointer without knowing the actual value? I would guess it's pretty rare. Interesting anyway!
By @mnw21cam - 6 months
The article states that the CPU has a limit of 4 instructions per cycle, but the sum2 method issues 5 instructions per cycle. Presumably one of them (maybe the increment) is trivial enough to be executed as a fifth instruction.
By @gpderetta - 6 months
Nice article!

Incidentally, value speculation (or prediction) is a way to break causality in concurrent memory models.

By @bobmcnamara - 6 months
The nodes are adjacent.
By @Atharshah - 6 months
I will change my game level