Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code
A developer fixed a seven-year-old bug in an iPad accessory causing missed MIDI messages by optimizing a modulo operation. The bug's resolution improved the audio processor's efficiency significantly.
Read original articleThe article recounts a developer's experience fixing a seven-year-old bug in a hardware accessory for the original iPad. The bug caused occasional missed MIDI messages, particularly affecting sustained instrument voices like a pipe organ. After three months of investigation, the developer identified a timing issue related to an inefficient 16-bit modulo operation on the 8-bit processor. By optimizing the modulo operation to three 8-bit modulos, the bug was resolved with a single-line code change, improving the audio processor's efficiency significantly. The bug had likely gone unnoticed for years due to most users using the device solely for audio or MIDI recording, where the issue did not manifest. The fix highlighted the importance of deep technical understanding in debugging and resolving complex software issues. The article concludes with a link to a video detailing the bug fix process.
Related
Agilent 2000a / 3000a Oscilloscope NAND Recovery
Anthony Kouttron salvaged a damaged Agilent oscilloscope, addressing physical and boot issues. He repaired an encoder, fixed cosmetic damages, and explored internal components, demonstrating technical prowess and troubleshooting skills.
Hacking eInk Price Tags (2021)
Hackers repurpose eInk electronic shelf labels (ESLs) into photo frames or status displays by customizing firmware. Detailed exploration of hacking challenges, including Marvell chip analysis, bootloader functions, memory storage, communication protocols, and debugging methods.
The First Spatial Computing Hack
Ryan Pickren found a Safari bug letting websites flood a user's space with 3D objects. Apple fixed it (CVE-2024-27812) in June after Ryan's report. The bug exploited Apple AR Kit Quick Look, launching objects without consent.
Vulnerability in Popular PC and Server Firmware
Eclypsium found a critical vulnerability (CVE-2024-0762) in Intel Core processors' Phoenix SecureCore UEFI firmware, potentially enabling privilege escalation and persistent attacks. Lenovo issued BIOS updates, emphasizing the significance of supply chain security.
I found an 8 years old bug in Xorg
An 8-year-old Xorg bug related to epoll misuse was found by a picom developer. The bug caused windows to disappear during server lock, traced to CloseDownClient events. Despite limited impact, the developer seeks alternative window tree updates, emphasizing testing and debugging tools.
> Knowing very little about USB audio processing, but having cut my teeth in college on 8-bit 8051 processors, I knew what kind of functions tended to be slow. I did a Ctrl+F for “%” and found a 16-bit modulo right in the audio processing code.
That feeling of saving days of work because you remember a clue from previous experience is so good.
Did I win? Of course not, it’s hard for non-technical people to fully appreciate these things and any sort of larger infrastructure work, esp for developer productivity because it goes back to well how you going to measure that ROI.
Anyways, this was fun to read and brought back good engineering memories. I’d also like to say, as it brought back a bug I chased forever, fuck you channelfactory in c#.
I've rerun test locally -- no fail. I've changed the seed to one that was used in failing run -- nothing.
I add a loop to the code to repeat text hundred times -- still nothing. I run the test in bash loop hundred time -- 3 fails. So this already hints on some internal problems. I fixed every possible source of randomness and verified that all the inputs are identical between the runs -- still fails only once in a while. I started building MWE, but the function involved in reproduction is fairly complicated. I'm left with a hundreds lines of Jax code which fails in couple of percent of cases.
I look at the output of the compiler, and it is identical between failing and successful runs. So the problem is in the compiled code. The compiled code is ~1000 lines of HLO (not much better than assembly). Unfortunately HLO tooling is both unfamiliar to me and not well fit to this case (or at least I couldn't figure it out). So I start manually bisecting the code. I'm finally left with ~30 lines of HLO. It fails even less often (1% maybe), but at least it runs fast. It also seems to fail in exactly the same way (i.e., there is single incorrect output that I've between 3 fails). Now that's something maintainers can be hoped to look at.
It turned out that matrices with same content but different layout were deduplicated, leading to, in my case, transposed matrix being replaced by non-transposed one. The hash used for storage did take layout into account so the bug appeared only if two entries ended up in the same bucket (~3% of times). The fix was an obvious one liner [1].
[1] https://github.com/openxla/xla/commit/76e7353599d914546f9b30...
I even wrote an 8051 assembler in C, but found a good tiny-C compiler for it before it went into production.
You are not a programmer unless you’ve written key-debounce code :)
(OTOH, some of the worst programmers I’ve ever had the displeasure of working with were amazing low-level code hackers. In olden times, it seems like you were either good at that level of abstraction, or you were good at a much different [“higher”] level, seldom both.)
This product was so old in fact that nobody knew how to compile the source code. "
I think you mean "Management was so bad, nobody knew how to compile the source code".
There are plenty of systems out there that can and and plenty that cannot be reproduced from source. The biggest difference is the card taken to do so, not the age.
Find dusty Perl script forgotten for years. Still works
Not the first time that I hear that
(IIRC UI scrolled twice for every mouse movement + you couldn't select items in server browser with mouse wheel as it would skip every other one)
If yamux's keepalive fails/times out, and you're calling Read on a demuxed stream, it blocks forever.
We spent over three months on it before finding a root cause. It was over two months before we could even understand how to measure it - we were seeing parts of the automated overnight test suite run taking longer, but every night it would be different tests that were slow. A key finding was that almost everything was slow on some boots of the device and fast on other boots of the device, and there was a reboot before each test was run. Doing some manual testing showed it being close to a 50% chance of a boot leading to slowness. Now what?
I eventually got frustrated and took the brute force / mindless approach... binary search over commits. Unfortunately, that wasn't easy because our build was 45-60 minutes, and then there was a heavily manual installation process that took 10-20 minutes, followed by several reboots to see if anything was slow. And there were several thousand commits since the last known good build (the previously shipped version of the device). The build/install/testing process was not easily automated, and we were not on git, otherwise using git-bisect would have been nice. Instead, I spent weeks doing the binary search manually.
That yielded the offending commit. The problem was that it was a massive commit (tens of thousands of lines of code) from a group in another part of the company. It was a snapshot of all of their development over the course of a couple of years. The commit message, and the authors, stated that the commit was a no-op with everything behind a disabled feature flag.
So now it was onto code level binary search. Keep deleting about half of the code in the commit, in this case by chunks that are intended to be inactive. After eventually deleting all the inactive code, there were still a few dozen lines of changes in a Linux subsystem that did window compositing. Those lines of code were all quite interdependent, so it was hard to delete much and keep things functional, so now on to walking through code. At least I could use my brain again!
Using the clue that the problem was happening about half the time and given that this code was in C, I started looking for uninitialized booleans. Sure enough, there was one called something like `enable_transparency`. Disabled code was setting it to `true`, but nothing was setting it to `false` when their system was disabled. Before their commit, there was no variable - `false` was being passed into the initializer call directly. Adding `= false` to the declaration was the fix.
So, well over a year of engineering hours spent to figure out the issue. The upside is that some people on the team didn't know how to proceed, so they spent their time speeding up random things that were slow. So the device ended up being noticeably faster when we were done. But it was pretty stressful as we were closing in on our launch date with little visibility into whether we'd figure it out or not.
Love to see it. That place needs more organic growth.
"Almost every bug turns out to be a 1 that should be 0, or a 0 that should be 1"
Keeping this in mind often keeps one focused on the detail of the underlying binary values and how they are being manipulated.
Anybody know what's the exact transformation here? I searched around and found this answer, but it doesn't work:
> I can still recall the cacophony of what amounted to an elephant on cocaine slamming on a keyboard for hours on end.
Related
Agilent 2000a / 3000a Oscilloscope NAND Recovery
Anthony Kouttron salvaged a damaged Agilent oscilloscope, addressing physical and boot issues. He repaired an encoder, fixed cosmetic damages, and explored internal components, demonstrating technical prowess and troubleshooting skills.
Hacking eInk Price Tags (2021)
Hackers repurpose eInk electronic shelf labels (ESLs) into photo frames or status displays by customizing firmware. Detailed exploration of hacking challenges, including Marvell chip analysis, bootloader functions, memory storage, communication protocols, and debugging methods.
The First Spatial Computing Hack
Ryan Pickren found a Safari bug letting websites flood a user's space with 3D objects. Apple fixed it (CVE-2024-27812) in June after Ryan's report. The bug exploited Apple AR Kit Quick Look, launching objects without consent.
Vulnerability in Popular PC and Server Firmware
Eclypsium found a critical vulnerability (CVE-2024-0762) in Intel Core processors' Phoenix SecureCore UEFI firmware, potentially enabling privilege escalation and persistent attacks. Lenovo issued BIOS updates, emphasizing the significance of supply chain security.
I found an 8 years old bug in Xorg
An 8-year-old Xorg bug related to epoll misuse was found by a picom developer. The bug caused windows to disappear during server lock, traced to CloseDownClient events. Despite limited impact, the developer seeks alternative window tree updates, emphasizing testing and debugging tools.