July 18th, 2024

Hash-Based Bisect Debugging in Compilers and Runtimes

Hash-Based Bisect Debugging uses binary search to locate code issues efficiently. It applies binary search to debug by bisecting data or version history, aiding in pinpointing bugs in code changes or optimizations.

Read original article

InterestAdmirationCuriosity

This article discusses the concept of Hash-Based Bisect Debugging in compilers and runtimes. It introduces the idea of using binary search techniques to pinpoint issues in code changes or optimizations. The article explains how binary search can be applied not only to finding items in a sorted list but also to debugging by bisecting data or version history. It provides examples of using binary search for debugging, such as identifying a bad card in a deck of punched cards or tracking down bugs in a program's version history. The article also mentions the evolution of tools like Bitkeeper and Git, which enable more efficient debugging through binary search over version history. It concludes with a detailed example of using Git's bisect feature to identify the specific commit that introduced a bug in a codebase.

Mix-testing: revealing a new class of compiler bugs

A new "mix testing" approach uncovers compiler bugs by compiling test fragments with different compilers. Examples show issues in x86 and Arm architectures, emphasizing the importance of maintaining instruction ordering. Luke Geeson developed a tool to explore compiler combinations, identifying bugs and highlighting the need for clearer guidelines.

How to implement a hash table in C (2021)

This article explains implementing a hash table in C, covering linear/binary search, hash table design, simple hash function, collision handling, resizing, and API design. Code snippets and GitHub repository link provided.

Using Git bisect to find bugs in MySQL code base

Troubleshooting MySQL database crashes or regressions can be complex. Using git bisect helps pinpoint the commit introducing a bug, aiding developers in efficiently identifying and addressing regressions for improved bug resolution.

Boosting Compiler Testing by Injecting Real-World Code

The research introduces a method to enhance compiler testing by using real-world code snippets to create diverse test programs. The approach, implemented in the Creal tool, identified and reported 132 bugs in GCC and LLVM, contributing to compiler testing practices.

Binary Search Tree with SIMD

Clément Jean presents a cache-friendly algorithm for binary search trees in Go, optimizing memory access with SIMD operations. The approach enhances search efficiency, demonstrated through benchmark tests, and suggests broader applications in data structures.

AI: What people are saying

The comments on the article about Hash-Based Bisect Debugging discuss various aspects and applications of the technique:

Several commenters highlight the use of bisecting techniques in different contexts, such as debugging optimization passes, maintaining demo compatibility, and troubleshooting large binaries.
Some comments mention advanced methods like Bayesian inference to handle noisy benchmarks and low-probability flakes, enhancing the robustness of the debugging process.
There are references to specific tools and scripts, such as LLVM's OptBisect and scripts for bisecting functions in assembly, which aid in pinpointing issues within code transformations.
Commenters also draw parallels to other debugging techniques, such as using coverage reports to understand test paths and the concept of "optimization fuel" to isolate problematic rewrites.
One commenter expresses curiosity about potential non-compiler use cases for the bisecting technique, indicating interest in broader applications.

13 comments

By @Scaevolus - 9 months

Aside: bisecting flakes doesn't have to involve repeated runs. You can reformulate bisection as an information probing operation, expanding the scope to support noisy benchmarks or low-probability flakes. Bayesian inference narrows down the probable range of the failure for each new observation, and you can choose new probes to maximize information gain-- or even run them in parallel to minimize the total time.

You do have to provide flake rate probability to do the probability estimates, but even roughly correct rates work fine. Running bisects assuming a 5% chance of getting a false positive or negative barely adds more steps and greatly improves robustness.

The math is pretty simple too-- my old prototype might still be in Google's monorepo; I should reimplement it for the open source world.

By @tabbott - 9 months

This is so cool!

It reminds me a bit of one my favorite debugging techniques: Running an individual test with coverage reporting enabled, and then clicking around the HTML coverage report to see exactly what code path the test followed, without needing to use either print statements or a specialized debugger.

Very helpful for answering questions of the form "Why does this test not exercise the code I thought it did?".

By @drivebycomment - 9 months

In my former life, I used to maintain a script that can be given two sets of objects files, one compiled with optimization, and one without, and the script will effectively do a binary search by choosing which object files you link and run the executable to determine success/fail. Each iteration is quick, since linking step is usually fast. This was useful when troubleshooting a big binary, since optimized build back then was often quite slow for a large executable.

By @tekknolagi - 9 months

This is very cool and seems similar to what we do/did in Cinder: https://bernsteinbear.com/blog/cinder-jit-bisect/

EDIT: Oops, this tool is mentioned in the post already (but the post is not, so here it is if you want to read about it). Neat!

By @carry_bit - 9 months

A closely related technique for debugging optimization passes is that of "optimization fuel". Each rewrite decreases the fuel by one, and when the fuel is gone no more rewrites happen. You can then perform binary search on the optimization fuel to find a specific rewrite instance that breaks things.

By @chc4 - 9 months

When I read https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ and saw the hash bisection trick for the first time I was super impressed, it really does sound incredibly slick :) I imagine that's how first coming across normal git bisect must feel to new engineers.

By @saagarjha - 9 months

I'm running one of these now, interestingly enough. Apparently something's broken in OpenJDK if you build with the new Xcode. So I'm bisecting on all the files (choosing either the old or new compiler) trying to see which one is breaking things.

By @mirrorlake - 9 months

In the event that this is added to the standard library, I'm going to be really curious to see what a "hello world" project/example would look like.

I went so far as to find the commit where David Chase added for loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try to design my own hello world with x/tools/cmd/bisect, but I'm out of my depth.

The hash tree is a great visualization. I wouldn't have grasped the importance of the hash suffix until I saw the tree. Awesome stuff.

By @MatzeBraun - 9 months

LLVM can bisect down to individual transformation steps within a pass (for passes that implement the interface): https://llvm.org/docs/OptBisect.html

And there is a script bisecting functions when given assembly produced by working baseline and “bad” compiler to compare: https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...

By @camgunz - 9 months

This is pretty close to what I built for maintaining demo compatibility in Doom engines. Basically it runs a demo and dumps a save game to a file every frame. As soon as there's a divergence it says what the difference is (monster 17 is at (4, 22); should be (4, 21)) and bails. Not a ton of difference between that and diffing the stack.

https://github.com/camgunz/democomp

By @millipede - 9 months

Are there any non-compiler use cases for this technique?

By @kragen - 9 months

this is a wonderful post!

the algorithms for using binary search to efficiently reduce a set satisfying some predicate to a locally minimal satisfying subset* are new to me (though cox says zeller published a slightly buggy version in 01999! and meta's cinder a correct one in 02021), and seem brilliant; their applications are not limited to debugging. i wonder how it relates to hypothesis's test-case reduction algorithm; can one of them be considered an application of the other?

also, this idea of binary-search debugging of the program's call tree rather than your revision history (or program input, or a set of data instances) is also a brilliant one. and although they published it a decade ago, i hadn't heard about it until now

the examples of asynctimerchan=1, changing optimization settings, and changing sort algorithms have in common that in some sense they are behavior-preserving, so you can toggle them on and off at will during execution without breaking anything. i wonder how to apply this call-tree debugging if the change you're trying to narrow down is a change that has to be consistent throughout the program's execution. for example, suppose some code using your hash tables breaks when you switch to a new hash function, maybe because it inadvertently depended on enumeration order. if you change the hash function partway through the program, you won't be able to find things in your hash tables after that. you could change the algorithm per table, of course, and narrow it down to a particular table, but that won't give you the particular line of code

i need to think a bit more about this issue of 'hashing a list of program counters'. you could of course number the sequence of all subroutine invocations during a (deterministic! single-threaded!) execution, as gas does for macro invocations, and binary-search that dense numbering. (this is a variant of the technique carry_bit is calling 'optimization fuel', but one that requires support from a compiler or debugger.) but, since you're toggling options on and off that will change the number of subroutine calls, the numbering won't be stable; so this will tend to only reliably find single-culprit failures

you could possibly get a stable-enough numbering using pathnames like /3/5/1, meaning the the 1st subroutine called from the 5th subroutine called from the 3rd subroutine called from main(). that seems like it might in some sense be stabler than hashing the entire list of return addresses, and it would certainly permit a lower-overhead implementation using a debugger and breakpoints rather than a check in every leaf call. plausibly i'm overlooking a flaw in this form of 'sequential numbering'? does the hashed list get truncated at some point for stability?

often when you have a change that is in some sense behavior-preserving, which is to say, you have two ways to do the same thing, you can use generative testing systems like hypothesis to detect bugs in either of them: process the same input through both paths and verify that the results are equivalent in the appropriate sense. this doesn't require the instrumentation infrastructure russ is using here, but it does depend on you being able to identify the relevant 'input', which can be very hard

in itself that doesn't help with the kinds of bugs he's talking about here, though: bugs where both the old and new code is 'equivalent' by your lights, but some other client code that calls it doesn't find it equivalent. this suggests a different generative-testing approach: generatively inject behavioral perturbations which don't violate equivalence, attempting to provoke failures in client code. aslr and hash-table seed randomization are doing this for us for some purposes, but unlike generative-testing frameworks, they provoke outages in production, don't do test-case minimization, and don't record failing cases to make bisection easy and prevent later regressions. and they don't do things like shuffling the input to a non-stable sort subroutine

binary-search debugging does indeed feel magical. scaevolus seems to be saying there's a bayesian generalization of it for nondeterministic bugs that are effectively random? you can of course run the test 5 (or 1000) times on each revision you're binary-searching over, but it feels like, if the number of revisions you're searching over is several thousand, you ought to be able to get some additional advantage out of running the test once on each of 5 (or 1000) revisions. can you solve this just by maximizing the expected shannon information of each test?

on a side note, it's pretty appalling that 30 years ago the plan9 group had `yesterday -d -n 7 anyfilename` to see what changed in the last week, thanks to their optical jukebox, while in the mainstream we still struggle with accidental file deletion and overwrites despite routinely carrying around terabytes in our pockets

on an even more marginally relevant note, earlier this week i was perusing the 7th edition unix kernel, in which the subroutine that switches stacks (the one with the well-known comment in 6th edition) is called swtch(). and tonight i just realized why russ cox uses that domain name

______

* conventionally this is just called a 'minimal satisfying subset', because it's 'minimal' in the partial-order sense, but i think cox's term 'locally minimal' is clearer

Hash-Based Bisect Debugging in Compilers and Runtimes

Related

Mix-testing: revealing a new class of compiler bugs

How to implement a hash table in C (2021)

Using Git bisect to find bugs in MySQL code base

Boosting Compiler Testing by Injecting Real-World Code

Binary Search Tree with SIMD

Related

Mix-testing: revealing a new class of compiler bugs

How to implement a hash table in C (2021)

Using Git bisect to find bugs in MySQL code base

Boosting Compiler Testing by Injecting Real-World Code

Binary Search Tree with SIMD