July 21st, 2024

Prometheus metrics saves us from painful kernel debugging

The Prometheus host metrics system detected increasing slab memory usage post Ubuntu 22.04 kernel upgrade. Identified AppArmor disablement as the cause, averted out-of-memory crashes by reverting changes. Monitoring setup crucial for swift issue resolution.

Read original article

Prometheus metrics saves us from painful kernel debugging

The author describes how their Prometheus host metrics system helped them identify a critical issue with increasing slab memory usage on Ubuntu 22.04 servers after a kernel upgrade. Despite no obvious memory-hogging processes, the metrics revealed a systemic problem affecting all servers. They traced the issue to a kernel command line change disabling AppArmor, causing a memory leak. By reverting the change and scheduling reboots, they averted potential widespread out-of-memory crashes. The author emphasizes the importance of their monitoring setup, which quickly pinpointed the problem and prevented misattributing it to the kernel upgrade. They highlight the significance of having both real-time and historical metrics in resolving the issue efficiently. The incident underscores the value of robust metrics systems in proactively addressing critical infrastructure issues.

CVE-2021-4440: A Linux CNA Case Study

The Linux CNA mishandled CVE-2021-4440 in the 5.10 LTS kernel, causing information leakage and KASLR defeats. The issue affected Debian Bullseye and SUSE's 5.3.18 kernel, resolved in version 5.10.218.

The weirdest QNX bug I've ever encountered

The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.

From Cloud Chaos to FreeBSD Efficiency

A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.

How we tamed Node.js event loop lag: a deepdive

Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.

5 comments

By @tass - 9 months

This awakened a memory from last year where a colleague and I were trying to understand where an increase in Linux memory was coming from in machines that hadn’t been rebooted in a while. This was alerted to by Prometheus metrics.

Even after all apps had been restarted, it persisted. Turned out to be a leak of slab memory allocations by a kernel module. That kernel module had since been updated, but all previous versions were still loaded by the kernel so the leak persisted until the next reboot.

The leaky kernel module was CrowdStrike’s falcon sensor. It started a discussion where engineering had no option but to run these things for the sake of security, there were no instances where it actually caught anything, but it had the potential to cause incidents and outages.

By @spiffytech - 9 months

My team spent weeks using log-aggregated metrics to gradually figure out why servers' clocks would go out of whack.

It turned out Docker Swarm made undocumented† use of a UDP port that some VMware product also used, and once in a while they'd cross the streams.

We only figured it out because we put every system event we could find onto a Grafana graph and narrowed down which ones kept happening at the same time.

† I think? It's been a while, might have just been hard to find.

By @louwrentius - 9 months

As a side note, I’m into storage performance and the node-exporter data is absolutely spot on. I performed storage benchmarks with FIO and the metrics matched the loads and reported os metrics (iostat) perfectly.

I actually made a Grafana dashboard[0] for it, but haven’t used this in a while myself.

[0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/

By @lordnacho - 9 months

Sounds like they have node on their machines. Not the js framework, but the prometheus/grafana package that gives you all the meters for a generic system monitoring dashboard. Disk usage, CPU, memory, it's all set up already, just plug and play.

In fact, I found a memory leak this way not long ago.

Super useful having this on your infra, saves a lot of time.

Prometheus metrics saves us from painful kernel debugging

Related

CVE-2021-4440: A Linux CNA Case Study

The weirdest QNX bug I've ever encountered

From Cloud Chaos to FreeBSD Efficiency

How we tamed Node.js event loop lag: a deepdive

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Related

CVE-2021-4440: A Linux CNA Case Study

The weirdest QNX bug I've ever encountered

From Cloud Chaos to FreeBSD Efficiency

How we tamed Node.js event loop lag: a deepdive

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags