August 5th, 2024

Phantom Menance: memory leak that wasn't there

The author's investigation into a perceived memory leak in a Rust application revealed it was a misunderstanding of misleading Grafana metrics, emphasizing the importance of accurate metric calculation in debugging.

Read original article

Phantom Menance: memory leak that wasn't there

The blog post discusses the author's experience with a perceived memory leak in a legacy Rust application during its migration to Kubernetes. Initially, the application, which processes images using ImageMagick, appeared to exhibit significant memory growth, prompting concerns about a memory leak. Despite Rust's reputation for safety, the author suspected that the issue stemmed from the Foreign Function Interface (FFI) with ImageMagick. Various tools, including eBPF and heaptrack, were employed to trace memory usage, but results indicated no actual leak. The author then utilized jemalloc's profiling features, which confirmed that memory usage was stable over time. Ultimately, the investigation revealed that the Grafana dashboard metrics were misleading, as they did not accurately reflect the application's memory usage without cache. The author concluded that the supposed memory leak was a "phantom menace," emphasizing the importance of understanding how metrics are calculated and the need for thorough investigation before jumping to conclusions.

- The perceived memory leak in a Rust application was ultimately a misunderstanding of metrics.

- Tools like heaptrack and jemalloc profiling were essential in diagnosing the issue.

- Grafana dashboard metrics were misleading, leading to incorrect assumptions about memory usage.

- Understanding how memory metrics are calculated is crucial in debugging applications.

- Collaboration and documentation are vital in troubleshooting complex issues.

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Encountered crashes in node_exporter on laptop traced to single bad RAM bit. Importance of ECC RAM for server reliability emphasized. Bad RAM block marked, GRUB 2 feature used. Heating RAM tested for stress behavior.

Prometheus metrics saves us from painful kernel debugging

The Prometheus host metrics system detected increasing slab memory usage post Ubuntu 22.04 kernel upgrade. Identified AppArmor disablement as the cause, averted out-of-memory crashes by reverting changes. Monitoring setup crucial for swift issue resolution.

The Process That Kept Dying: A memory leak murder mystery (node)

An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.

Crafting Interpreters with Rust: On Garbage Collection

Tung Le Vo discusses implementing a garbage collector for the Lox programming language using Rust, addressing memory leaks, the mark-and-sweep algorithm, and challenges posed by Rust's ownership model.

Debugging a rustc segfault on Illumos

The author debugged a segmentation fault in the Rust compiler on illumos while compiling `cranelift-codegen`, using various tools and collaborative sessions to analyze the issue within the parser.

3 comments

By @JohnMakin - 9 months

Great writeup, have definitely been bit by container_memory_working_set_bytes before - in fact when I first read the description of the dashboard I was wondering if it was something like that. This is one of the most difficult parts of working in SRE or DevOps, often each side of the conversation lacks the necessary context to understand what the problem is. As author mentioned, this is still a worthwhile metric to monitor, but its meaning is not necessarily clear in this context.

Kudos to this author for digging in - as a DevOps/SRE guy, I'd imagine this conversation often going in companies I have worked for like "Something is wrong with your dashboard" and my team being like "something is wrong with your application" and nothing gets done for months while managers point fingers and figure out whose problem it is.

By @jeffbee - 9 months

> Monitoring container_memory_working_set_bytes still makes sense since it is the metric that kublet uses to kill the pod when it exceeds the limits.

This is not a great way to describe it. When a container is out of memory then the kernel ends it, and the kubelet is not involved. This is the main way that most users will experience OOM conditions. Note that it is not possible for RSS to exceed limit, because the task will end the instant it tries to realize memory that would have put it over the limit.

When the node is under total memory pressure then the kubelet uses working set to rank pods for eviction. Working set is used for that in an attempt to attribute kernel resources like page caches to each control group. But eviction due to node memory pressure should be rare.

By @jmugan - 9 months

Love the title!

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Prometheus metrics saves us from painful kernel debugging

The Process That Kept Dying: A memory leak murder mystery (node)

An investigation into a recurring 502 Bad Gateway error on a crowdfunding site revealed a memory leak caused by Moment.js. Updating the library resolved the issue, highlighting debugging challenges.

Crafting Interpreters with Rust: On Garbage Collection

Debugging a rustc segfault on Illumos

The author debugged a segmentation fault in the Rust compiler on illumos while compiling `cranelift-codegen`, using various tools and collaborative sessions to analyze the issue within the parser.

Phantom Menance: memory leak that wasn't there

Related

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Prometheus metrics saves us from painful kernel debugging

The Process That Kept Dying: A memory leak murder mystery (node)

Crafting Interpreters with Rust: On Garbage Collection

Debugging a rustc segfault on Illumos

Related

Debugging an evil Go runtime bug: From heat guns to kernel compiler flags

Prometheus metrics saves us from painful kernel debugging

The Process That Kept Dying: A memory leak murder mystery (node)

Crafting Interpreters with Rust: On Garbage Collection

Debugging a rustc segfault on Illumos