July 19th, 2024

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.

Read original articleLink Icon
DedicationFrustrationAdmiration
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

The author describes encountering a series of crashes in the node_exporter, a Go-based monitoring tool, on their laptop. Despite suspicions of a hardware issue, extensive testing revealed a single bad bit in the RAM, causing occasional errors that worsened with temperature. While this issue was unlikely to be the root cause of the crashes, it highlighted the importance of ECC RAM for long-term reliability in servers. The author opted to mark the bad RAM block and avoid using it, leveraging a GRUB 2 feature. Additionally, they experimented with heating the RAM to observe its behavior under stress. The article underscores the significance of hardware reliability in maintaining system stability, especially in critical environments like servers.

Related

The weirdest QNX bug I've ever encountered

The weirdest QNX bug I've ever encountered

The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.

Game dev accuses Intel of selling 'defective' Raptor Lake CPUs

Game dev accuses Intel of selling 'defective' Raptor Lake CPUs

Alderon Games criticizes Intel's 13th and 14th-gen Core CPUs for stability issues, crashes, and memory corruption, particularly affecting Raptor Lake models like Core i9-13900K and Core i9-14900K. Despite Intel's attempts to fix with updates, Alderon switches to AMD due to fewer crashes. Intel investigates the issues.

Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up

Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up

Complaints arise over crashing issues on 13th and 14th Gen Intel CPUs, prompting MMO developer Alderon Games to switch servers to AMD due to persistent instability. Reports vary on the extent of affected processors.

Intel's CPUs Are Failing, Ft. Wendell of Level1 Techs [video]

Intel's CPUs Are Failing, Ft. Wendell of Level1 Techs [video]

Issues with Linux causing crashes, CPU problems with GPU vram limits, and Intel processors experiencing crashes are highlighted in a video sponsored by NZXT. Concerns arise over Intel CPUs' stability for game servers.

Dev reports Intel's laptop CPUs are also suffering from crashing issues

Dev reports Intel's laptop CPUs are also suffering from crashing issues

Dev reports Intel laptop CPUs facing crashing issues, extending to 13th and 14th-Gen processors. Instability persists despite attempted fixes, impacting flagship Core i9 HX series. Reports suggest widespread degradation, raising concerns for users.

AI: What people are saying
The article on node_exporter crashes due to bad RAM and the importance of ECC RAM sparked various discussions.
  • Several commenters shared personal experiences with debugging complex system issues, highlighting the frustration and dedication required.
  • There was admiration for the author's technical skills and dedication, with some feeling inadequate in comparison.
  • Technical insights were shared, such as using GRUB to ignore bad memory blocks and the challenges of small stack sizes in certain environments.
  • Links to related resources and previous discussions were provided for further reading.
  • Some comments expressed curiosity about specific technical details and sought clarifications.
Link Icon 12 comments
By @Terr_ - 4 months
> Over the course of 22 kernel builds, I managed to simplify the config so much that the kernel had no networking support, no filesystems, no block device core, and didn’t even support PCI (still works fine on a VM though!).

Flashbacks to a job where I was asked to figure out why a newer kernel was crashing. This was a very frustrating time, because I had (have) basically zero real C/C++ experience but I'd helped out with Bitbake recipes and everyone else was busy or moved to other projects.

To cut a multiweek tale of dozens of recompilations short: The kernel was fine. The headless custom hardware was fine. The problem was a hypervisor misconfiguration, overwriting part of the kernel address space. All of our kernels have been corrupt, but this was the first one where the layout meant it mattered.

A month of frustration, two characters to fix, the highest ratio I've encountered so far.

My reward for struggling through a complex problem I was unqualified for? "Great, now we need to backport security patches from the main Linux kernel to the SoC vendor's custom fork..."

By @mseepgood - 4 months
By @BobbyJo - 4 months
This is honestly wild. 99% of devs would have found a work around and moved on. Going so far as to create a multi-kernel test bench to narrow down the source of the instability is a level of dedication I have not personally seen, and I respect it.
By @ncruces - 4 months
Related (for the hash based bisecting): https://research.swtch.com/bisect
By @wolf550e - 4 months
You can follow Hector Martin @marcan at https://social.treehouse.systems/@marcan/

He works on Asahi Linux, a Linux port to arm64 Apple hardware.

By @umvi - 4 months
I'm pretty confident in my computer abilities, but when I read stuff like this I feel like I have no skills at all compared to this guy. Like I'm still a high school athlete and he's an Olympian (also he was only 26 when he wrote the article).
By @Agingcoder - 4 months
This is very elegant. I’ve had my share of nasty system bugs ( compilers and kernels ) , but the dedication and the speed with which he went through it is quite remarkable.

The explanations are also very clear. Thanks for posting.

By @Thaxll - 4 months
One thing I learned from that post back then is that you can instruct Grub to ignore some part of your physical memory. Really nice trick, not sure this is doable on Windows / Mac?
By @im3w1l - 4 months
I feel like I still didnt fully understand what's going on here. Is the following correct? "Threads hava a 'canonical' stack that the OS auto-grows for you as you use more of it. But you can also create your own stack by putting any value you want in RSP. This is what the Go program did, and the vDSO, assuming it ran on an auto-growing stack, tried to probe it, which lead to corruption."
By @fulafel - 4 months
I guess Gentoo Hardened meant the difficulty level in this case.
By @Hajcnga - 4 months
Could someone explain why Google is obsessed with small stack sizes? The musl library also has an extremely small thread stack size, which makes many applications crash that are used to 8192K on Linux.

On Linux the 8192K aren't reserved unless the are actually used, so what is the point?

Ok, Golang will allocate green threads from its own allocator, but 104 bytes?!