Beating the L1 cache with value speculation (2021)
Value speculation leverages branch predictor to guess values, enhancing instruction parallelism and L1 cache efficiency. Demonstrated on Xeon E5-1650 v3, it boosts throughput from 14GB/s to 30GB/s by predicting linked list nodes.
Read original articleValue speculation is a technique that leverages the branch predictor to guess values, enabling more instruction parallelism and removing bottlenecks on the L1 cache. By guessing the next node in a linked list, the CPU can avoid waiting for data dependencies, significantly improving performance. This technique is demonstrated through a real-world example on a Xeon E5-1650 v3 processor, showing a throughput increase from 14GB/s to 30GB/s for certain datasets. The trick exploits the fact that nodes in a linked list are usually contiguous in memory, allowing for speculative value guessing. By reducing the impact of data dependencies and utilizing the branch predictor, the CPU can achieve higher throughput rates by avoiding delays related to reading from the L1 cache. This optimization showcases the intricate workings of modern CPUs in handling instructions in parallel and highlights the importance of understanding branch prediction and CPU cache behavior for performance enhancements in low-level code execution.
Related
Exploring How Cache Memory Works
Cache memory, crucial for programmers, stores data inside the CPU for quick access, bridging the CPU-RAM speed gap. Different cache levels vary in speed and capacity, optimizing performance and efficiency.
Microbenchmarking Return Address Branch Prediction (2018)
Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.
Finnish startup says it can speed up any CPU by 100x
A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Weird things I learned while writing an x86 emulator
The article explores writing an x86 and amd64 emulator for Time Travel Debugging, emphasizing x86 encoding, prefixes, flag behaviors, shift instructions, segment overrides, FS and GS segments, TEB structures, CPU configuration, and segment handling nuances in 32-bit and 64-bit modes.
Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.
This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!
Clever.
If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.
He's just published "Finding Simple Rewrite Rules for the JIT with Z3":
https://www.pypy.org/posts/2024/07/finding-simple-rewrite-ru...
So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?
There have been a lot of comments about the example presented being quite artificial and I agree but it is simple enough to help the reader understand what's happening and why it's faster.
In fact, it would be fairly common for the nodes in linked lists to be sequential in ram anyway. For example this code shows that the next node is easy to guess. The nodes do end up exactly in sequence in memory:
#include <stdlib.h>
#include <stdio.h>
typedef struct Node {
int value;
struct Node *next;
} Node;
Node *head = NULL;
Node *tail = NULL;
int main(int argc, char **argv) {
// Allocate some nodes
for (int i = 0; i < 100; i++) {
Node *new_node = malloc(sizeof(Node));
new_node->value = rand();
new_node->next = NULL;
if (tail == NULL) {
head = tail = new_node;
} else {
tail->next = new_node;
tail = new_node;
}
}
// Print their locations in memory
for (Node *current = head; current->next != NULL; current = current-> next) {
printf("%p\n", current);
}
}
And I think it only works if Spectre mitigations are disabled anyway?
What the trick does is replace sequential fetches (where each fetch address depends on the result of the previous fetch because, well, linked lists) with parallel fetches. It takes the minimum fetch-to-fetch latency from a L1 cache hit (roughly 3 cycles IIRC) to a cycle or less (most CPUs can do multiple parallel fetches per cycle).
If your data is stored in a vector or a B-tree, accesses are already parallel by default and you'll never need this trick.
I don’t think there is any reasonable scenario where you would be using a linked list but the memory is contiguous most of the time.
Incidentally, value speculation (or prediction) is a way to break causality in concurrent memory models.
Related
Exploring How Cache Memory Works
Cache memory, crucial for programmers, stores data inside the CPU for quick access, bridging the CPU-RAM speed gap. Different cache levels vary in speed and capacity, optimizing performance and efficiency.
Microbenchmarking Return Address Branch Prediction (2018)
Modern processors use branch predictors like RAS to boost performance by predicting control flow. Microbenchmarking on Intel and AMD processors reveals RAS behavior, accuracy, and limitations, emphasizing accurate branch prediction for high performance.
Finnish startup says it can speed up any CPU by 100x
A Finnish startup, Flow Computing, introduces the Parallel Processing Unit (PPU) chip promising 100x CPU performance boost for AI and autonomous vehicles. Despite skepticism, CEO Timo Valtonen is optimistic about partnerships and industry adoption.
Do not taunt happy fun branch predictor
The author shares insights on optimizing AArch64 assembly code by reducing jumps in loops. Replacing ret with br x30 improved performance, leading to an 8.8x speed increase. Considerations on branch prediction and SIMD instructions are discussed.
Weird things I learned while writing an x86 emulator
The article explores writing an x86 and amd64 emulator for Time Travel Debugging, emphasizing x86 encoding, prefixes, flag behaviors, shift instructions, segment overrides, FS and GS segments, TEB structures, CPU configuration, and segment handling nuances in 32-bit and 64-bit modes.