Thoughts on Debugging
The article emphasizes the importance of reproducing issues in debugging, effective communication, logging for real-time analysis, clear documentation for recognition, and warns junior engineers about the absence of favors in business.
Read original articleThe article discusses the author's experiences and insights on debugging in software development. The author enjoys tackling complex debugging situations, emphasizing the importance of reproducing issues as the primary step in the debugging process. They highlight that debugging requires a blend of technical skills, communication, and a methodical approach to problem-solving. The author notes that many issues can often be traced back to simple configuration errors or misunderstandings of functionality. They advocate for asynchronous coordination among team members, while also recognizing the value of collaborative problem-solving when necessary. Logging is identified as a crucial tool in debugging, allowing teams to monitor and analyze issues in real-time. The author stresses the need for clear documentation of the debugging process to ensure proper recognition of contributions and to prevent misrepresentation of roles. They conclude with a cautionary note for junior engineers about the nature of business relationships, stating that favors do not exist in a professional context.
- Debugging requires reproducing issues as the first step.
- Effective communication and collaboration are essential in debugging.
- Logging is a critical tool for monitoring and analyzing issues.
- Clear documentation of the debugging process is important for recognition.
- Junior engineers should be aware that favors do not exist in business.
Related
How to build highly-debuggable C++ binaries
David Hashe's article offers guidance on configuring C++ toolchains for better debuggability, emphasizing practices like enabling sanitizers, using debug modes, and balancing performance with debuggability in large projects.
You've only added two lines – why did that take two days
The article highlights that in software development, the number of lines of code does not reflect effort. Effective bug fixing requires thorough investigation, understanding context, and proper testing to prevent recurring issues.
Systemic Software Debugging (2012)
Systemic Software Debugging introduces complex debugging issues beyond traditional methods, created for Sony Mobile engineers. It spans 150 pages, is freely available under CC-BY-3.0, and welcomes feedback.
Software is about people, not code (2020)
Software development prioritizes understanding human needs over coding skills. Successful projects depend on user engagement, collaboration, and communication to ensure solutions effectively address real-world problems.
What 10k Hours of Coding Taught Me: Don't Ship Fast
The article emphasizes the importance of developer experience and software architecture, advocating for simplicity in coding, prioritizing refactoring, and maintaining code quality through structured practices and passion for the craft.
Here’s my one simple rule for debugging:
Reproduce the issue.
Unless it's one of those cursed things installed at the customer thousands of miles away that never happens back in the lab.Some things can be incredibly hard to debug and can depend on the craziest of things you'd never even consider possible. Like a thunderstorm causing voltage spikes that very subtly damage the equipment causing subtle failures a few months later. Sometimes that "software bug" turns out to be hardware in weird ways. Or issues like https://web.mit.edu/jemorris/humor/500-miles – every person who's debugging weird issues should read that.
Once you can actually reproduce the issue, you've often done 80-99+% of the work already.
> Reproduce the issue.
> . . .
> I’m not sure that I’ve ever worked on a hard problem . . .
I agree, the author has probably not worked on hard problems.
There are many situations where either a) reproducing the problem is extremely elusive, or b) reproduction is easy, but the cause is especially elusive, or c) organizational issues prevent access to the system, to source code, or to people willing to investigate and/or solve the issue.
Some examples:
For A, the elusive reproduction, I saw an issue where we had an executive escalation that their laptop would always blue screen shortly after boot up. Nobody could reproduce this issue. Telemetry showed nobody else had this issue. Changing hardware didn't fix it. Only this executive had the anti-Midas touch to cause the issue. Turned out the executive lived on a large parcel, and had exactly one WiFi network visible. Some code parsing that list of WiFi APs had an off-by-one error which caused a BSOD. A large portion of wireless technology (Bluetooth/Thread/WiFi/cellular) bugs fall into this category.
For B, the easy to repro but still difficult, I've seen numerous bugs that cause stack corruption, random memory corruption, or trigger a hardware flaw that freezes or resets the system. These types of issues are terrible to debug, because either the logs aren't available (system comes down before the final moments), or because the culprit doesn't log anything and never harms themselves, only an innocent victim. Time-travel tracing is often the best solution, but is also often unavailable. Bisecting the code changes is sometimes little help in a busy codebase, since the culprit is often far away from their victims.
Category C is also pretty common if you are integrating systems. Vendors will have closed source and be unable or unwilling to admit even the possibility of fault, help with an investigation, or commit to a fix. Partners will have ship blocking bugs in hardware that they just can't show you or share with you, but it must nonetheless get fixed. You will often end up shipping workarounds for errors in code you don't control, or carefully instrumenting code to uncover the partner's issues.
Once you have done this, you are already over the hump. It's like being the first rider over the last mountain on the Tour de France stage, you've more or less won by doing that.
I'm not sure I even consider it a challenge if the issue is easily reproduced. You will simply grind out the solution once you have the reproduction done.
The real bugs are the ones that you cannot replicate. The kind of thing that breaks once a week on 10 continuously running machines. You can't scale that system to 1000 or more with the bug around, you'll be swamped with reports. But you also can't fix it because the conditions to reproduce it are so elusive that your logs aren't useful. You don't know if all the errors have the same root cause.
Typically the kind of thing that creates a lot of paths to check is "manual multithreading". The kind of thing where you have a bunch of threads, each locking and unlocking (or forgetting either) access to shared data in a particular way. The number of ways where "this read and then that writes" explodes quite fast with such code, and it also explodes in a way that isn't obvious from the code. Sprinkling log outputs over such code can change the frequency of the errors.
If you catch yourself thinking, it's probably X. Then you should try to prove yourself wrong. Because if your are, you are looking in the wrong place. And if you are struggling to understand why a thing is happening you can safely assume that something you assume to be true is in fact not true. Invalidating that assumption would be how you figure out why.
Assumptions can range from "there's a bug in a library we are using", "the problem must have been introduced recently", "the problem only happens when we do X", etc. Most of these things are fairly simple to test.
The other day I was debugging someone else's code that I inherited. I started looking at the obvious place in the code, adding some logging and I was getting nowhere. Then I decided to try to reproduce the problem in a place where that code was definitely not used to challenge the assumption I was making that the problem even was in that part of the code. I instantly managed to reproduce the issue. I wasted two hours staring at that code and trying to understand it.
In the end, the issue was with a weird bug that only showed up when using our software in the US (or as it turns out, the western hemisphere). The problem wasn't the functionality I was testing but everything that used negative coordinates.
Once I narrowed it down to a simple math problem with negative longitudes and I realized the problem was a missing call to abs where we subtracting values (subtracting a negative value means you are adding it). That function was used in four different places; each of those was broken. Easy fix and the problem went away. Being in Europe (only positive longitudes), we just never properly tested that part of our software in the US. The bug had lurked there for over a year. Kind of embarrassing really.
Which is why randomizing your inputs in unit tests is important. We were testing with just one hard coded coordinate. The fix included me adding proper unit tests for the algorithm.
One thing that makes me sad about the pervasive use of async/await-style programming is that it usually breaks the stack in a way that makes this technique a bit useless.
Really, you have a ”one-system” where you can see _ALL_ the logs? I don’t believe that. This whole software thing is abstractions everywhere, and we are probably using some abstraction somewhere that isn’t compatible with this fabled ”one-system”.
Often the most debugging takes place on the least observable systems.
Related
How to build highly-debuggable C++ binaries
David Hashe's article offers guidance on configuring C++ toolchains for better debuggability, emphasizing practices like enabling sanitizers, using debug modes, and balancing performance with debuggability in large projects.
You've only added two lines – why did that take two days
The article highlights that in software development, the number of lines of code does not reflect effort. Effective bug fixing requires thorough investigation, understanding context, and proper testing to prevent recurring issues.
Systemic Software Debugging (2012)
Systemic Software Debugging introduces complex debugging issues beyond traditional methods, created for Sony Mobile engineers. It spans 150 pages, is freely available under CC-BY-3.0, and welcomes feedback.
Software is about people, not code (2020)
Software development prioritizes understanding human needs over coding skills. Successful projects depend on user engagement, collaboration, and communication to ensure solutions effectively address real-world problems.
What 10k Hours of Coding Taught Me: Don't Ship Fast
The article emphasizes the importance of developer experience and software architecture, advocating for simplicity in coding, prioritizing refactoring, and maintaining code quality through structured practices and passion for the craft.