October 7th, 2024

Smolderingly Fast B-Trees

The article compares hashmaps and b-trees in scripting languages, highlighting hashmaps' vulnerabilities and b-trees' stability. Benchmarks show b-trees can compete under certain conditions, influenced by key types and optimizations.

Read original article

CuriosityAppreciationSkepticism

The article discusses the performance comparison between hashmaps and b-trees, particularly in the context of scripting languages like Rust and Zig. It highlights the limitations of hashmaps, such as vulnerability to hash flooding, non-deterministic iteration order, and performance issues during rehashing. In contrast, b-trees are presented as a more stable alternative, though they are generally slower. The author conducts various benchmarks to measure lookup times for both data structures under different conditions, revealing that while b-trees may appear slower in some tests, they can perform comparably in others, especially when considering the impact of speculative execution on hashmaps. The article also notes that the performance of b-trees can be affected by the type of keys used, particularly with string keys, where cache lookups for comparisons can become costly. The author suggests optimizations for b-trees, such as adjusting node sizes and layouts, to improve performance. Overall, the findings indicate that while hashmaps are often faster, b-trees offer advantages in certain scenarios, particularly in terms of stability and predictability.

- B-trees provide a stable alternative to hashmaps, avoiding issues like hash flooding and non-deterministic iteration.

- Performance benchmarks show that b-trees can be competitive with hashmaps under specific conditions.

- The type of keys used significantly affects the performance of b-trees, especially with string keys.

- Optimizations in b-tree implementation, such as node size and layout adjustments, can enhance performance.

- Speculative execution benefits hashmaps more than b-trees, impacting their relative performance in lookups.

Ask HN: Fast data structures for disjoint intervals?

The author seeks recommendations for innovative data structures to improve read speeds for managing disjoint time intervals, noting that existing solutions often do not outperform simple ordered maps.

An extensive benchmark of C and C++ hash tables

The article introduces a benchmarking suite for C and C++ hash-table libraries, evaluating performance across various configurations and operations, while noting limitations in memory usage metrics and lower load factors.

AI: What people are saying

The comments on the article reveal a variety of perspectives on the comparison between hashmaps and b-trees.

Several commenters suggest exploring alternative data structures, such as skiplists, crit-bit trees, and adaptive radix trees, which may offer better performance than b-trees or hashmaps.
There is a discussion about the impact of memory usage and the efficiency of different data structures in various scenarios, particularly when data fits in core memory.
Some commenters express interest in the methodology of the benchmarks, questioning the choice of hash functions and suggesting additional optimizations for b-trees.
Concerns are raised about the article's conclusions, with some feeling that the performance differences highlighted may not be significant enough to favor b-trees over hashmaps.
There is a call for more comprehensive benchmarks and real-world comparisons to better understand the performance of these data structures in practical applications.

19 comments

By @BeeOnRope - 3 months

To answer a question implied in the article, per-lookup timing with rdtscp hurts the hash more than the btree for the same reason the hash is hurt by the data-depending chaining: rdtscp is an execution barrier which prevents successive lookups from overlapping. rdtsc (no p) isn't, and would probably produce quite different timings.

That the btree doesn't benefit from overlapping adjacent lookup/inserts is intereting.

I suppose it is because btree access (here) involves data-dependent branches, and so with random access you'll get about L mispredicts per lookup in an L-deep tree, so adjacent lookups are separated by at least one mispredict: so adjacent lookup can overlap execution, but the overlapping is useless since everything beyond the next mispredict is useless as it is on the bad path.

That's probably at least true for the small map regime. For the larger maps, the next iteration is actually very useful, even if on a mispredicted path, because the date accesses are at the right location so it serves to bring in all the nodes for the next iteration. This matters a lot outside of L2. At 5 instructions per comparison and 32-element nodes, however, there are just so many instructions in the window for 1 lookup it's hard to make it to the next iteration.

So b-trees benefit a lot from a tight linear seach (e.g. 2 instructions per check, macro-fused to 1 op), or a branch-free linear search, or far better than those for big nodes, a vectorized branch-free search.

By @scythmic_waves - 3 months

I appreciate write-ups of failed experiments like this. They're sorta like null results in science, but for engineering. And they can help others from needlessly walking down the same path.

If everyone only wrote about their successes, we'd all have to independently rediscover failures behind closed doors.

By @aidenn0 - 3 months

It's been a while since I last tried things, but I found crit-bit trees[1] to be much faster than b-trees. Hash array-mapped tries are also good if you don't need the things that trees give you (e.g. in-order traversal, get all values in a certain range).

1: https://cr.yp.to/critbit.html

By @josephg - 3 months

I'd be curious to see how performance would change from storing b-tree entries in a semi-sorted array, and applying various other optimizations from here:

https://en.algorithmica.org/hpc/data-structures/b-tree/

The aggregate performance improvements Sergey Slotin gets from applying various "tricks" is insane.

By @ur-whale - 3 months

Unless I'm missing something, title of the article doesn't really correlate with its conclusion.

By @kibo_money - 3 months

Very interesting ! You mentioned the memory usage at the end, BTreeMaps are actually better than HashMaps most of the time, at least for Rust

Here's a good break down: https://ntietz.com/blog/rust-hashmap-overhead/

By @cmrdporcupine - 3 months

Appreciate the attention to detail on the microbenchmarks.

Skimming through, need to read in more detail later, but what I would love to see is a real world comparison against just linear search in a vector. Either of associated pairs, or two vectors (one for key, one for value, with matching offsets).

My hunch is that people in general are more often than not reaching for hash-tables (and sometimes trees) for the API convenience of an associative structure -- but that on modern architectures with decent cache sizes and for small(ish) data sets they can be outperformed by a simple O(N) lookup.

For example, it would be an interesting experiment to take something like the Python runtime (or the JDK etc) and replace its dictionary type with vectors -- at least for small dictionaries -- and then run through some common applications and frameworks and see what impact this has.

By @ww520 - 3 months

Adaptive radix tree is pretty good as well, with support for in order listing and range query. It can beat b-tree and come closely behind hashmap.

By @orlp - 3 months

Why was Rust's hashmap only tested with SipHash? It's known to be pretty bad for performance.

I'm biased as the author of course, but try adding a benchmark with the Rust hasher + foldhash as well: https://github.com/orlp/foldhash.

By @pjdesno - 3 months

It would be interesting to compare the Python sortedcontainers algorithm - I've been using a C++ version of it that works quite well.

Note also that nodes in B-trees (and other splitting-based data structures) have a mean load factor more like 75% - 50% is the minimum load for non-root nodes, right after splitting, and 100% is the max right before splitting.

By @evanmoran - 2 months

This post reminded me of this killer btree implementation for swift. The analysis on GitHub is fantastic and reads like a thriller novel :)

https://github.com/attaswift/BTree#why

By @helltone - 3 months

Possibly off topic, but I was wondering: what are the most comprehensive data structure benchmarks out there?

By @waynecochran - 3 months

If you can keep all the data in core memory why not just use you favorite balanced binary search tree (BST) like a red-black tree. B-Trees only help performance, at least in common understanding, when all the data can not be in core memory (which is why databases use them).

By @winwang - 3 months

Would be really interesting to see a similar study but with skiplists! I'd imagine it would be slower than hashmaps for many of the exact same reasons outlined in the article, but numbers would be cool.

By @tekknolagi - 3 months

I thought a lot of b(+)tree advantage was in bigger-than-RAM something or other for large databases and these benchmarks seem relatively small in comparison

By @nialv7 - 3 months

I feel I missed point of this article. I thought the author is trying to prove that b-tren isn't that bad compared to hashmaps. But taking 2~3x longer looks pretty bad.

If I need predictable ordering (but not actually sorting the keys) I will use something like indexmap, not b-tree.

By @lsb - 3 months

Clojure, for example, uses Hash Array Mapped Tries as its associative data structure, and those work well

By @georgehill - 3 months

By @BeeOnRope - 3 months

Nice article!

Very cool to see both the "independent" and "serially dependent" cases addressed. Microbenchmarks still have lots of ways of giving the wrong answer, but looking at both these cases exposes one of the big variables which cause that.

In my experience looking at container performance you often pass through two distinct regimes (in a microbenchmark!):

Small regime: for small containers, instruction count, instruction dependencies and IPC (including the effect of branch missed) dominate.

In this regime fastest container in a "throughput" sense will often be the one with fewest micro-operations (including those executed on the wrong-path). Fewer operations helps both in raw speed and also in overlapping more multiple independent lookups within the instruction window. Any type of per-lookup misprediction is hugely destructive to performance. For random keys, this often favors hash tables because they can be written to have << 1 mispredict per lookup.

In this small regime the fastest container in a latency sense is the one with the shortest critical path from input to output, again considering mispredicts. The non-memory latency instruction will be very important in this critical path and again mispredicts are very destructive since usually mispredicts add directly to the critical path (not always!). There are lots of tricks to keeping the critical path including hashes with somewhat higher operation counts but smaller critical paths (e.g., a short merge), linear searches which have > 1 independent stream, etc. If the keys are predictable, hashes containers can look bad because they tend to have a long data-dependency from the hash through the lookup to the output. Tree-like containers tend to replace those with control, so the data-dependent critical path can be very short! With random keys, hashes win again because mispredicts are so destructive.

Then in the large regime, a lot of the same themes repeat but instead of applying to "all instructions", it's mostly about memory access. I.e., the winning throughput containers are the ones that can get the highest useful MLP, and the winning latency containers are the ones with the shortest critical path of data-dependent memory accesses, mostly ignoring everything else. Instructions still matter because MLP is often dependent on how many accesses you can stuff into the processors OoOE execution window, and the size of that structure is (roughly speaking) counted in instructions. Software prefetching can help a ton with stuffing the right memory accesses in the OoOE window.

For random keys and "usually hit", hashes again tend to win in this regime, because they can usually get down to 1 miss per lookup, and that's math the other structures just can't overcome. For non-random keys, the field is wide open, it depends heavily on the key distribution. For lookups which often miss there are plenty of ways to break the 1 miss-per-lookup barrier too.

Smolderingly Fast B-Trees

Related

Ask HN: Fast data structures for disjoint intervals?

An extensive benchmark of C and C++ hash tables

Related

Ask HN: Fast data structures for disjoint intervals?

An extensive benchmark of C and C++ hash tables