August 5th, 2024

Phantom Menance: memory leak that wasn't there

The author's investigation into a perceived memory leak in a Rust application revealed it was a misunderstanding of misleading Grafana metrics, emphasizing the importance of accurate metric calculation in debugging.

Read original articleLink Icon
Phantom Menance: memory leak that wasn't there

The blog post discusses the author's experience with a perceived memory leak in a legacy Rust application during its migration to Kubernetes. Initially, the application, which processes images using ImageMagick, appeared to exhibit significant memory growth, prompting concerns about a memory leak. Despite Rust's reputation for safety, the author suspected that the issue stemmed from the Foreign Function Interface (FFI) with ImageMagick. Various tools, including eBPF and heaptrack, were employed to trace memory usage, but results indicated no actual leak. The author then utilized jemalloc's profiling features, which confirmed that memory usage was stable over time. Ultimately, the investigation revealed that the Grafana dashboard metrics were misleading, as they did not accurately reflect the application's memory usage without cache. The author concluded that the supposed memory leak was a "phantom menace," emphasizing the importance of understanding how metrics are calculated and the need for thorough investigation before jumping to conclusions.

- The perceived memory leak in a Rust application was ultimately a misunderstanding of metrics.

- Tools like heaptrack and jemalloc profiling were essential in diagnosing the issue.

- Grafana dashboard metrics were misleading, leading to incorrect assumptions about memory usage.

- Understanding how memory metrics are calculated is crucial in debugging applications.

- Collaboration and documentation are vital in troubleshooting complex issues.

Link Icon 3 comments
By @JohnMakin - 7 months
Great writeup, have definitely been bit by container_memory_working_set_bytes before - in fact when I first read the description of the dashboard I was wondering if it was something like that. This is one of the most difficult parts of working in SRE or DevOps, often each side of the conversation lacks the necessary context to understand what the problem is. As author mentioned, this is still a worthwhile metric to monitor, but its meaning is not necessarily clear in this context.

Kudos to this author for digging in - as a DevOps/SRE guy, I'd imagine this conversation often going in companies I have worked for like "Something is wrong with your dashboard" and my team being like "something is wrong with your application" and nothing gets done for months while managers point fingers and figure out whose problem it is.

By @jeffbee - 7 months
> Monitoring container_memory_working_set_bytes still makes sense since it is the metric that kublet uses to kill the pod when it exceeds the limits.

This is not a great way to describe it. When a container is out of memory then the kernel ends it, and the kubelet is not involved. This is the main way that most users will experience OOM conditions. Note that it is not possible for RSS to exceed limit, because the task will end the instant it tries to realize memory that would have put it over the limit.

When the node is under total memory pressure then the kubelet uses working set to rank pods for eviction. Working set is used for that in an attempt to attribute kernel resources like page caches to each control group. But eviction due to node memory pressure should be rare.

By @jmugan - 7 months
Love the title!