February 19th, 2025

Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode

Ubicloud encountered significant reliability issues with Hetzner's AX162 servers, which crashed more frequently than the AX161. After troubleshooting and motherboard replacements, newer versions improved stability, highlighting cautious technology adoption.

Read original articleLink Icon
FrustrationSkepticismCaution
Debugging Hetzner: Uncovering failures with powerstat, sensors, and dmidecode

Ubicloud's experience with Hetzner's AX162 server line revealed significant reliability issues, with these servers being 16 times more likely to crash than their predecessors, the AX161. After purchasing the AX162 servers, Ubicloud faced multiple crashes, prompting extensive debugging efforts. Initial investigations ruled out system load and temperature as causes, leading to the hypothesis of power consumption limits affecting hardware reliability. Using tools like powerstat, they measured power consumption and found discrepancies between advertised and actual usage. The Annualized Failure Rate (AFR) analysis confirmed the AX162's high failure rate. After reporting their findings to Hetzner, the company identified defects in a batch of motherboards and recommended replacements. Although new motherboards initially showed improved stability, crashes resumed after two weeks. Eventually, a newer motherboard version resolved the issues, resulting in a lower AFR than the AX161 servers. Ubicloud's experience underscores the importance of cautious adoption of new technology, advocating for thorough vetting and gradual implementation of new hardware to mitigate risks.

- Hetzner's AX162 servers experienced a significantly higher crash rate compared to the AX161 model.

- Initial investigations ruled out load and temperature as causes of server crashes.

- Power consumption limits were suspected to contribute to hardware reliability issues.

- Replacement of motherboards improved stability, but further issues arose until newer versions were implemented.

- Ubicloud plans to adopt new hardware more cautiously in the future to avoid similar problems.

AI: What people are saying
The comments reflect a range of experiences and concerns regarding Hetzner's server reliability and related issues.
  • Multiple users report reliability issues with various Hetzner AX models, particularly related to faulty motherboards.
  • Some commenters share personal experiences with hardware failures and the need for replacements, emphasizing the importance of monitoring and troubleshooting.
  • There is a discussion about the implications of power capping on hardware longevity, with differing opinions on its effects.
  • Several users suggest that waiting for newer versions of technology can help avoid early adoption issues.
  • Concerns are raised about the overall reliability of Hetzner's services, questioning whether they still deserve the label of "reliable."
Link Icon 19 comments
By @nik736 - 2 days
Most other AX models (AX42, AX52 and AX102) also have serious reliability issues, where they will fail after some months. They are based on a faulty motherboard. Hetzner has to replace most, if not all, motherboards for servers built before a certain date over the next 12 months [0]

[0] https://docs.hetzner.com/robot/dedicated-server/general-info...

By @jonatron - 2 days
At a previous company, devops would regularly find CPU fan failures on Hetzner. That's in addition to the usual expected HD/SSD failures. You've got to do your own monitoring, it's one of the reasons why unmanaged servers are cheaper than cloud instances.
By @V__ - 2 days
> Looking back, waiting six months could have helped us avoid many issues. Early adopters usually find problems that get fixed later.

This is really good advice and what I'm following for all systems which need to be stable. If there aren't any security issues, I either wait a few months or keep one or two versions behind.

By @andai - 2 days
> Hetzner didn’t confirm or deny the possibility of power limiting

What are the consequences of power limiting? The article says it can cause hardware to degrade more quickly, why?

Hetzner's lack of response here (and UbiCloud's measurements) seems to suggest they are indeed limiting power, since if they weren't doing it, they'd say so, right?

By @bayindirh - 2 days
Dell has this problem sometimes. I remember getting the first batch one of their older servers when they were new. We had to replace motherboards' I/O (rear) section because the servers lost some devices on that part (e.g.: Ethernet controllers, iDRAC, sometimes BIOS) for some time. After shaking out these problems, they ran for almost a decade.

We recently retired them because we worn down everything on these servers. From RAID cards to power regulators. Rebooting a perfectly running server due to a configuration change and losing the RAID card forever because electron migration erode a trace inside the RAID processor is a sobering experience.

By @vitus - 2 days
> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

Can anyone elaborate on this point? This is counter to my intuition (and in fact, what I saw upon a cursory search), which is that power capping should prolong the useful lifetime of various components.

The only search results I found that claimed otherwise were indicating that if you're running into thermal throttling, then higher operating temperatures can cause components (e.g. capacitors) to degrade faster. But that's expressly not the case in the article, which looked at various temperature sensors.

By @chronid - 2 days
We will never know, but I wonder if it could be a power/signaling or VRM issue - the CPU non getting hot doesn't mean something else on the board has gone out of spec and into catastrophic failure.

Motherboard issues around power/signaling are a pain to diagnose, they will emerge as all sort of problems apparently related to other components (ram failing to initialize and random restarts are very common in my experience) and you end up swapping everything before actually replacing the MB...

By @rikafurude21 - 2 days
Similar thing happened to a AX102 I currently use, something related the network card which caused crashes. Thankfully hetzner support was helpful with replacement hardware. caused quite some grief but at least it was a good lesson in hardware troubleshooting. Worth it to me personally
By @urbandw311er - 2 days
Would anybody with data center experience be able to hazard a guess on what type of commercial resolution Hetzner would have reached with the Motherboard supplier here? Would we assume all mobos replaced free of charge plus compensation?
By @scottcha - 2 days
I’d like to see what cpu governor is running on those systems before assuming a power cap is in place. Lots of defaults installs of Linux ship with the power save governor running which is going to limit your max frequencies and through that the max power you can hit.
By @vednig - 2 days
as a CI/CD provider wouldn't it benefit if Ubicloud had their own servers?
By @wink - 2 days
> One of the providers we like is Hetzner because of their affordable and reliable servers.

> In the days that followed, the crash frequency increased.

I don't find the article conclusive whether they would still call them reliable.

By @dangoodmanUT - 2 days
is there a provider that's like bare metal, but would detect these kinds of things mostly automatic? E.g. faulty or constantly crashing hardware.
By @jauntywundrkind - 2 days
> To increase the number of machines under power constraints, data center operators usually cap power use per machine. However, this can cause motherboards to degrade more quickly.

This was something I hadn't heard before, & a surprise to me.

By @gtirloni - 2 days
Anyone got experience with Ubicloud's OpenStack stack?
By @trod1234 - 1 day
It would have been nice if they linked to the power metrics for the new servers.

I think it would be amusing if it turns out they just raised the power limits for those servers not showing the problem up to base that was originally advertised.

By @indulona - 2 days
i am so glad my sign up process with hetzner failed when i was so dumb that i wanted to give them a chance even with the internet full of horrific stories of bad experiences from their customers. lucky me.