June 23rd, 2024

B-Trees Require Fewer Comparisons Than Balanced Binary Search Trees

B-trees require fewer comparisons than balanced binary search trees due to better access locality. Larger k values in B-trees lead to tighter bounds, outperforming AVL trees for k>=8, offering practical advantages.

Read original article

B-Trees Require Fewer Comparisons Than Balanced Binary Search Trees

B-trees are found to require fewer comparisons than balanced binary search trees due to their better access locality. While binary search trees have a lower bound of log2 n comparisons in the worst case, maintaining perfect balance requires O(n) time. On the other hand, balanced binary search trees like AVL and Red-Black trees have slightly worse bounds, guaranteeing at most 1.44 log2 n and 2 log2 n comparisons, respectively. In the case of B-trees with degree k, the number of comparisons varies based on the value of k. For larger k values, B-trees approach the lower bound of log2 n comparisons, outperforming AVL trees for k>=8. Practical B-trees often use large k values, such as 100, offering tight bounds and being more cache-friendly than binary search trees. The analysis suggests that larger k values lead to tighter bounds, making B-trees increasingly similar to sorted arrays and improving their performance in practice.

Flambda2 Ep. 2: Loopifying Tail-Recursive Functions

Flambda2's Episode 2 explores Loopify, an optimization algorithm for tail-recursive functions in OCaml. It transforms recursion into loops, enhancing memory efficiency without compromising functional programming principles.

How does a computer/calculator compute logarithms?

Computers and calculators compute logarithms using geometric series and polynomial equations from calculus. Natural logarithm's relation to $\frac{1}{1+x}$ is explained through curve areas. Polynomial series derived from integrating geometric series compute natural logarithms, but converge quickly only for $|x| < \frac{1}{2}."

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Identifying Leap Years (2020)

David Turner explores optimizing leap year calculations for performance gains by using bitwise operations and integer bounds. He presents efficient methods, mathematical proofs, and considerations for signed integers, highlighting limitations pre-Gregorian calendar.

Optimizing the Roc parser/compiler with data-oriented design

The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.

9 comments

By @shagie - 10 months

Long ago, I was taking the sophomore level data structures and algorithms class. It was being transitioned from C to C++ and being taught by Professor DeWitt. I have vague memories of linked lists and then binary trees. ... And then we got to the spot where the syllabus from previous years had AVL trees and red black trees... and he scoffed at teaching it.

And that I do remember. I suspect he had been doing lecture planning over the weekend (this was in the "we're redoing the class, teach it as it needs to be taught" type mandate) and he had us skip that section of the text. The fist lecture he was going off of overhead transparencies that he was writing as he lectured. The second lecture he gave us photocopies of hand written notes and diagrams.

Instead of AVL and red black trees, we instead learned about 2-3 trees and B trees.

For what it's worth, I still know how to do 2-3 trees from scratch.

Later I looked at AVL trees... and they still give me a headache.

By @kazinator - 10 months

How B-trees can win is by taking advantage of the node children being sorted in an array. This allows a binary search to be used.

A B-tree of any branching factor (not only 2) is equivalent to a binary tree.

We can take any B-tree node n:

           n
           |
  ---------+-----------
  |   |    |   |   |  |
  a   b    c   d   e  f

and regard it as an unbalanced tree

    n
    |
   / \
  a  /\
     b/\
      c/\
       d/\
        e/\
         f  nil

When searching for an item, if we linearly search every node of the B-tree, like from a to f, it as if we are traversing the unbalanced tree (except for constant-time instruction and cache level efficiencies).

What the unbalanced tree cannot do is binary search through a to f to find which one to descend into; it doesn't have those nodes in a neat array for that.

That binary search makes B-tree more or less equivalent to a very balanced binary tree.

By @cperciva - 10 months

In a sense, red-black and 2-3 trees are just a special case of B-trees with small nodes. Smaller B-tree nodes have (by a constant factor) more expensive searching than larger nodes; but smaller nodes are more efficient for mutating operations. And indeed, if you have an insert/delete heavy workload you can't beat something like a red-black tree.

By @masklinn - 10 months

> Practical B-trees often use fairly large values of k (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.

That is true on-disk, where you would not use BSTs anyway. In-memory, mid to high single digit values are practical. For instance iirc Rust’s btree (map and set) uses something like k=6.

By @shin_lao - 10 months

B-Tree are a good fit if your values are small, but as the value size grows, they lose their advantage.

Binary tree is a good default choice for sorted structure.

Underappreciated structure: sorted arrays (but come with big caveats).

By @rtheunissen - 10 months

That does not mean that b-trees are unequivocally better than binary search trees. There are some applications, like concurrency or persistence, where comparison count is not as important.

For instance, fewer values per node means less information to copy when copying the node. Locking is more granular with fewer values per node because fewer values are locked at a time.

The binary structure also makes it possible to support randomized balancing and self-adjusting strategies like splay trees.

I am disappointed to still see frequent mention of red-black trees – I see no reason for red-black trees to ever be the best choice in practice, or in theory, or in school. LBST's (logarithmic binary search trees) [1] are generally simpler, more intuitive, and more efficient [2] than red-black trees. They also support positional access inherently, so the balancing information is useful (subtree size).

[1] https://www.semanticscholar.org/paper/A-New-Method-for-Balan... [2] https://rtheunissen.github.io/bst

By @orlp - 10 months

One thing to keep in mind is that keeping the keys sorted on an insert requires O(k) element moves, so an insert is O(log2(n) + k) operations. So if you make k very large your inserts get slower.

By @exabrial - 10 months

I think either is a good choice for knowing nothing about the data set.

After n grows large enough, I think you’d want to benchmark your -exact- dataset on your -exact- prod hardware. Some datasets may cause certain trees to perform badly. Some hardware is much faster at memory lookups and cache misses are not as painful.

For instance, developing locally on Intel and deploying to Arm could yield pretty wild performance differences!

By @nayuki - 10 months

Blogspot supports HTTPS. Please always use it when linking to the site.

B-Trees Require Fewer Comparisons Than Balanced Binary Search Trees

Related

Flambda2 Ep. 2: Loopifying Tail-Recursive Functions

How does a computer/calculator compute logarithms?

Own Constant Folder in C/C++

Identifying Leap Years (2020)

Optimizing the Roc parser/compiler with data-oriented design

Related

Flambda2 Ep. 2: Loopifying Tail-Recursive Functions

How does a computer/calculator compute logarithms?

Own Constant Folder in C/C++

Identifying Leap Years (2020)

Optimizing the Roc parser/compiler with data-oriented design