July 21st, 2024

Prometheus metrics saves us from painful kernel debugging

The Prometheus host metrics system detected increasing slab memory usage post Ubuntu 22.04 kernel upgrade. Identified AppArmor disablement as the cause, averted out-of-memory crashes by reverting changes. Monitoring setup crucial for swift issue resolution.

Read original articleLink Icon
Prometheus metrics saves us from painful kernel debugging

The author describes how their Prometheus host metrics system helped them identify a critical issue with increasing slab memory usage on Ubuntu 22.04 servers after a kernel upgrade. Despite no obvious memory-hogging processes, the metrics revealed a systemic problem affecting all servers. They traced the issue to a kernel command line change disabling AppArmor, causing a memory leak. By reverting the change and scheduling reboots, they averted potential widespread out-of-memory crashes. The author emphasizes the importance of their monitoring setup, which quickly pinpointed the problem and prevented misattributing it to the kernel upgrade. They highlight the significance of having both real-time and historical metrics in resolving the issue efficiently. The incident underscores the value of robust metrics systems in proactively addressing critical infrastructure issues.

Link Icon 5 comments
By @tass - 6 months
This awakened a memory from last year where a colleague and I were trying to understand where an increase in Linux memory was coming from in machines that hadn’t been rebooted in a while. This was alerted to by Prometheus metrics.

Even after all apps had been restarted, it persisted. Turned out to be a leak of slab memory allocations by a kernel module. That kernel module had since been updated, but all previous versions were still loaded by the kernel so the leak persisted until the next reboot.

The leaky kernel module was CrowdStrike’s falcon sensor. It started a discussion where engineering had no option but to run these things for the sake of security, there were no instances where it actually caught anything, but it had the potential to cause incidents and outages.

By @spiffytech - 6 months
My team spent weeks using log-aggregated metrics to gradually figure out why servers' clocks would go out of whack.

It turned out Docker Swarm made undocumented† use of a UDP port that some VMware product also used, and once in a while they'd cross the streams.

We only figured it out because we put every system event we could find onto a Grafana graph and narrowed down which ones kept happening at the same time.

† I think? It's been a while, might have just been hard to find.

By @louwrentius - 6 months
As a side note, I’m into storage performance and the node-exporter data is absolutely spot on. I performed storage benchmarks with FIO and the metrics matched the loads and reported os metrics (iostat) perfectly.

I actually made a Grafana dashboard[0] for it, but haven’t used this in a while myself.

[0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/

By @lordnacho - 6 months
Sounds like they have node on their machines. Not the js framework, but the prometheus/grafana package that gives you all the meters for a generic system monitoring dashboard. Disk usage, CPU, memory, it's all set up already, just plug and play.

In fact, I found a memory leak this way not long ago.

Super useful having this on your infra, saves a lot of time.