Prometheus metrics saves us from painful kernel debugging
The Prometheus host metrics system detected increasing slab memory usage post Ubuntu 22.04 kernel upgrade. Identified AppArmor disablement as the cause, averted out-of-memory crashes by reverting changes. Monitoring setup crucial for swift issue resolution.
Read original articleThe author describes how their Prometheus host metrics system helped them identify a critical issue with increasing slab memory usage on Ubuntu 22.04 servers after a kernel upgrade. Despite no obvious memory-hogging processes, the metrics revealed a systemic problem affecting all servers. They traced the issue to a kernel command line change disabling AppArmor, causing a memory leak. By reverting the change and scheduling reboots, they averted potential widespread out-of-memory crashes. The author emphasizes the importance of their monitoring setup, which quickly pinpointed the problem and prevented misattributing it to the kernel upgrade. They highlight the significance of having both real-time and historical metrics in resolving the issue efficiently. The incident underscores the value of robust metrics systems in proactively addressing critical infrastructure issues.
Related
CVE-2021-4440: A Linux CNA Case Study
The Linux CNA mishandled CVE-2021-4440 in the 5.10 LTS kernel, causing information leakage and KASLR defeats. The issue affected Debian Bullseye and SUSE's 5.3.18 kernel, resolved in version 5.10.218.
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
From Cloud Chaos to FreeBSD Efficiency
A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.
How we tamed Node.js event loop lag: a deepdive
Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.
Even after all apps had been restarted, it persisted. Turned out to be a leak of slab memory allocations by a kernel module. That kernel module had since been updated, but all previous versions were still loaded by the kernel so the leak persisted until the next reboot.
The leaky kernel module was CrowdStrike’s falcon sensor. It started a discussion where engineering had no option but to run these things for the sake of security, there were no instances where it actually caught anything, but it had the potential to cause incidents and outages.
It turned out Docker Swarm made undocumented† use of a UDP port that some VMware product also used, and once in a while they'd cross the streams.
We only figured it out because we put every system event we could find onto a Grafana graph and narrowed down which ones kept happening at the same time.
† I think? It's been a while, might have just been hard to find.
I actually made a Grafana dashboard[0] for it, but haven’t used this in a while myself.
[0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/
In fact, I found a memory leak this way not long ago.
Super useful having this on your infra, saves a lot of time.
Related
CVE-2021-4440: A Linux CNA Case Study
The Linux CNA mishandled CVE-2021-4440 in the 5.10 LTS kernel, causing information leakage and KASLR defeats. The issue affected Debian Bullseye and SUSE's 5.3.18 kernel, resolved in version 5.10.218.
The weirdest QNX bug I've ever encountered
The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.
From Cloud Chaos to FreeBSD Efficiency
A client shifted from expensive Kubernetes setups on AWS and GCP to cost-effective FreeBSD jails and VMs, improving control, cost savings, and performance. Real-world tests favored FreeBSD over cloud solutions, emphasizing efficient resource management.
How we tamed Node.js event loop lag: a deepdive
Trigger.dev team resolved Node.js app performance issues caused by event loop lag. Identified Prisma timeouts, network congestion from excessive traffic, and nested loop inefficiencies. Fixes reduced event loop lag instances, aiming to optimize payload handling for enhanced reliability.
Debugging an evil Go runtime bug: From heat guns to kernel compiler flags
Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.