Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Read original articleThe author describes encountering a series of crashes in the node_exporter, a Go-based monitoring tool, on their laptop. Despite suspicions of a hardware issue, extensive testing revealed a single bad bit in the RAM, causing occasional errors that worsened with temperature. While this issue was unlikely to be the root cause of the crashes, it highlighted the importance of ECC RAM for long-term reliability in servers. The author opted to mark the bad RAM block and avoid using it, leveraging a GRUB 2 feature. Additionally, they experimented with heating the RAM to observe its behavior under stress. The article underscores the significance of hardware reliability in maintaining system stability, especially in critical environments like servers.
Related
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
Game dev accuses Intel of selling 'defective' Raptor Lake CPUs
Alderon Games criticizes Intel's 13th and 14th-gen Core CPUs for stability issues, crashes, and memory corruption, particularly affecting Raptor Lake models like Core i9-13900K and Core i9-14900K. Despite Intel's attempts to fix with updates, Alderon switches to AMD due to fewer crashes. Intel investigates the issues.
Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up
Complaints arise over crashing issues on 13th and 14th Gen Intel CPUs, prompting MMO developer Alderon Games to switch servers to AMD due to persistent instability. Reports vary on the extent of affected processors.
Intel's CPUs Are Failing, Ft. Wendell of Level1 Techs [video]
Issues with Linux causing crashes, CPU problems with GPU vram limits, and Intel processors experiencing crashes are highlighted in a video sponsored by NZXT. Concerns arise over Intel CPUs' stability for game servers.
Dev reports Intel's laptop CPUs are also suffering from crashing issues
Dev reports Intel laptop CPUs facing crashing issues, extending to 13th and 14th-Gen processors. Instability persists despite attempted fixes, impacting flagship Core i9 HX series. Reports suggest widespread degradation, raising concerns for users.
- Several commenters shared personal experiences with debugging complex system issues, highlighting the frustration and dedication required.
- There was admiration for the author's technical skills and dedication, with some feeling inadequate in comparison.
- Technical insights were shared, such as using GRUB to ignore bad memory blocks and the challenges of small stack sizes in certain environments.
- Links to related resources and previous discussions were provided for further reading.
- Some comments expressed curiosity about specific technical details and sought clarifications.
Flashbacks to a job where I was asked to figure out why a newer kernel was crashing. This was a very frustrating time, because I had (have) basically zero real C/C++ experience but I'd helped out with Bitbake recipes and everyone else was busy or moved to other projects.
To cut a multiweek tale of dozens of recompilations short: The kernel was fine. The headless custom hardware was fine. The problem was a hypervisor misconfiguration, overwriting part of the kernel address space. All of our kernels have been corrupt, but this was the first one where the layout meant it mattered.
A month of frustration, two characters to fix, the highest ratio I've encountered so far.
My reward for struggling through a complex problem I was unqualified for? "Great, now we need to backport security patches from the main Linux kernel to the SoC vendor's custom fork..."
He works on Asahi Linux, a Linux port to arm64 Apple hardware.
The explanations are also very clear. Thanks for posting.
On Linux the 8192K aren't reserved unless the are actually used, so what is the point?
Ok, Golang will allocate green threads from its own allocator, but 104 bytes?!
Related
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
Game dev accuses Intel of selling 'defective' Raptor Lake CPUs
Alderon Games criticizes Intel's 13th and 14th-gen Core CPUs for stability issues, crashes, and memory corruption, particularly affecting Raptor Lake models like Core i9-13900K and Core i9-14900K. Despite Intel's attempts to fix with updates, Alderon switches to AMD due to fewer crashes. Intel investigates the issues.
Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up
Complaints arise over crashing issues on 13th and 14th Gen Intel CPUs, prompting MMO developer Alderon Games to switch servers to AMD due to persistent instability. Reports vary on the extent of affected processors.
Intel's CPUs Are Failing, Ft. Wendell of Level1 Techs [video]
Issues with Linux causing crashes, CPU problems with GPU vram limits, and Intel processors experiencing crashes are highlighted in a video sponsored by NZXT. Concerns arise over Intel CPUs' stability for game servers.
Dev reports Intel's laptop CPUs are also suffering from crashing issues
Dev reports Intel laptop CPUs facing crashing issues, extending to 13th and 14th-Gen processors. Instability persists despite attempted fixes, impacting flagship Core i9 HX series. Reports suggest widespread degradation, raising concerns for users.