Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode
Ubicloud encountered significant reliability issues with Hetzner's AX162 servers, which crashed more frequently than the AX161. After troubleshooting and motherboard replacements, newer versions improved stability, highlighting cautious technology adoption.
Read original articleUbicloud's experience with Hetzner's AX162 server line revealed significant reliability issues, with these servers being 16 times more likely to crash than their predecessors, the AX161. After purchasing the AX162 servers, Ubicloud faced multiple crashes, prompting extensive debugging efforts. Initial investigations ruled out system load and temperature as causes, leading to the hypothesis of power consumption limits affecting hardware reliability. Using tools like powerstat, they measured power consumption and found discrepancies between advertised and actual usage. The Annualized Failure Rate (AFR) analysis confirmed the AX162's high failure rate. After reporting their findings to Hetzner, the company identified defects in a batch of motherboards and recommended replacements. Although new motherboards initially showed improved stability, crashes resumed after two weeks. Eventually, a newer motherboard version resolved the issues, resulting in a lower AFR than the AX161 servers. Ubicloud's experience underscores the importance of cautious adoption of new technology, advocating for thorough vetting and gradual implementation of new hardware to mitigate risks.
- Hetzner's AX162 servers experienced a significantly higher crash rate compared to the AX161 model.
- Initial investigations ruled out load and temperature as causes of server crashes.
- Power consumption limits were suspected to contribute to hardware reliability issues.
- Replacement of motherboards improved stability, but further issues arose until newer versions were implemented.
- Ubicloud plans to adopt new hardware more cautiously in the future to avoid similar problems.
Related
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Hetzner Pricing
Hetzner Online GmbH controls its data centers and builds its own servers, optimizing costs and performance. Its automated processes and transparent pricing benefit long-term customers, offering flexibility in service cancellation.
Who are AMD, Intel's new manycore monster CPUs for?
Intel and AMD are launching high-core-count CPUs for server consolidation, but organizations should evaluate risks, costs, and disaster recovery capabilities before adoption, as hyperscale providers are better equipped for manycore systems.
AWS and Azure Are at Least 4x–10x More Expensive Than Hetzner
Hetzner offers virtual machine instances 4 to 10 times cheaper than AWS and Azure, promoting self-hosting solutions to reduce costs and improve performance, as companies reassess their cloud provider choices.
Hetzner raises prices while significantly lowering bandwidth (US)
Hetzner will raise US cloud service prices by up to 27.52% and reduce bandwidth by 88.19% starting December 1, 2024, with changes reflected in invoices from February 1, 2025.
- Multiple users report reliability issues with various Hetzner AX models, particularly related to faulty motherboards.
- Some commenters share personal experiences with hardware failures and the need for replacements, emphasizing the importance of monitoring and troubleshooting.
- There is a discussion about the implications of power capping on hardware longevity, with differing opinions on its effects.
- Several users suggest that waiting for newer versions of technology can help avoid early adoption issues.
- Concerns are raised about the overall reliability of Hetzner's services, questioning whether they still deserve the label of "reliable."
[0] https://docs.hetzner.com/robot/dedicated-server/general-info...
This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.
What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?
Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?
We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.
Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.
The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.
Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...
> In the days that followed, the crash frequency increased.
I don't find the article conclusive whether they would still call them reliable.
This was something I hadn't heard before, & a surprise to me.
I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.
Related
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Hetzner Pricing
Hetzner Online GmbH controls its data centers and builds its own servers, optimizing costs and performance. Its automated processes and transparent pricing benefit long-term customers, offering flexibility in service cancellation.
Who are AMD, Intel's new manycore monster CPUs for?
Intel and AMD are launching high-core-count CPUs for server consolidation, but organizations should evaluate risks, costs, and disaster recovery capabilities before adoption, as hyperscale providers are better equipped for manycore systems.
AWS and Azure Are at Least 4x–10x More Expensive Than Hetzner
Hetzner offers virtual machine instances 4 to 10 times cheaper than AWS and Azure, promoting self-hosting solutions to reduce costs and improve performance, as companies reassess their cloud provider choices.
Hetzner raises prices while significantly lowering bandwidth (US)
Hetzner will raise US cloud service prices by up to 27.52% and reduce bandwidth by 88.19% starting December 1, 2024, with changes reflected in invoices from February 1, 2025.