August 30th, 2024

Profiling with Ctrl-C

Ctrl-C profiling is an effective method for identifying performance issues in programs, offering a simpler alternative to traditional profilers, especially in challenging environments and for less experienced users.

Read original articleLink Icon
CuriosityFrustrationAmusement
Profiling with Ctrl-C

The article discusses the concept of "Ctrl-C profiling," a method of using a debugger to analyze a program's call stack by interrupting its execution. The author reflects on their initial skepticism towards this approach, believing it inadequate for complex problems. However, they have come to appreciate its effectiveness for simpler issues, particularly in challenging environments. The author shares personal experiences where Ctrl-C profiling helped identify performance bottlenecks, such as slow startup times due to a JSON parser and inefficiencies in the LLD linker. They argue that while traditional profilers have their place, Ctrl-C profiling is often easier to implement and can yield quick insights without the need for extensive setup or interpretation of complex outputs. The article concludes that while Ctrl-C profiling may not replace all profiling methods, it serves as a valuable tool for quickly diagnosing problems, especially for those less familiar with more sophisticated profiling techniques.

- Ctrl-C profiling can effectively identify performance issues in programs.

- It is a simpler alternative to traditional profilers, requiring less setup and interpretation.

- The method is particularly useful in unfriendly environments or for users with limited experience.

- While not a replacement for all profiling methods, it offers quick insights into program behavior.

- The author emphasizes the practicality of using Ctrl-C profiling for everyday programming challenges.

AI: What people are saying
The comments reflect a variety of perspectives on profiling and debugging techniques in programming.
  • Users share creative and unconventional methods for profiling, such as using timers or simple hacks in embedded systems.
  • Some commenters discuss more systematic approaches, like using tools such as rr for recording and analyzing program behavior.
  • There are mentions of challenges with existing tools, including issues with gdb and DWARF data handling.
  • Several users express a preference for simpler debugging techniques over complex profilers, emphasizing practicality.
  • There is a general sentiment of frustration with the limitations of current profiling tools and the need for better solutions.
Link Icon 11 comments
By @exmadscientist - 8 months
My favorite hack along these lines was to put a timer/ISR on an embedded system that did nothing more than crawl up the stack frame the two or three addresses that the ISR used (yep, it was really just as dumb as [sp + 8] or whatever), and then dump that address to the serial terminal every second or so.

You can fix a lot of stupid problems that way. (And most problems are stupid.) Yes, yes, a real profiler would be better, but if you don't have the fancy tools because your employer doesn't buy you such things, and it's a primitive and cruddy embedded system so there's no obvious better way to do it, and you built this horrible hack right now and... hey, the hack solved the problem, and what do you know? it keeps on solving things....

By @dzaima - 8 months
For something more systematic/reproducible, it's possible to use rr[1] to record the program, and in a replay run to the end (or whatever boundaries you care about), run "when-ticks", and do various "seek-ticks 123456789" below that number to seek to various points in the recording.

I've made a thing[2] that can display that within a visual timeline (interpolated between ticks of the nearest syscalls/events, which do have known real time), essentially giving a sampling flamegraph that can be arbitrarily zoomed-in, with the ability to interact with it at any point in gdb.

Though this is not without its issues - rr's ticks count retired conditional branches, so a loop with a 200-instruction body takes up the same amount of ticks, and thus visual space, as one with a 5-instruction body; and of course more low-level things like mispredicts/stalls/IPC are entirely lost.

[1]: https://rr-project.org/

[2]: https://github.com/dzaima/grr

By @dooglius - 8 months
I wonder how hard it would be to have a profiler dump a big chunk of stack on each sample interrupt, convert these into core dump format, and then use gdb or whatever to decode the traces for analysis? This ought to have the touted benefits without the downside of it being slow to capture a bunch of samples.
By @ivoras - 8 months
Speaking of keyboard shortcuts, I miss BSD's Ctrl-T and SIGINFO. It often helped to see if a process was hung.
By @Cybergenik - 8 months
>Apparently gcc generates some DWARF data that gdb is slow to handle. The GNU linker fixes this data, so that gdb doesn’t end up handling it slowly. LLD refuses to emulate this behavior of the GNU linker, because it’s gcc’s fault to have produced that DWARF data in the first place. And gdb refuses to handle LLD’s output efficiently, because it’s LLD’s fault to not have handled gcc’s output the way the GNU linker does. So I just remove -ggdb3 - it gives you a bit richer debug info, but it’s not worth the slower linking with gold instead of LLD, nor the slowdown in gdb that you get with LLD. And everyone links happily ever after.

lol, it's a story as old as time. The infinite loop of ego entrenched developers not wanting to change something out of some trivial inconsequential disagreement. The bike shed will be built my way!

By @omgtehlion - 8 months
I mostly use GUI-based debuggers (and profilers), but even in this case I found it often useful to pause the program at random times when it appears "stuck".

Most of the time I don't event need to reach for a profiler proper.

By @lelanthran - 8 months
> what do you know, there’s one billion stack frames from the nlohmann JSON parser, I guess it all gets inlined in the release build;

My guess would be that it's because tail-call optimisation only happens in -O2 and above.

Parsing recursively is frequently the cleanest way to implement a parser of tree-structured input, after all.

If you're doing anything recursively, it makes sense to slightly restructure the recursive call to be the last call in the scope, so that TCO can be applied.

By @kreyenborgi - 8 months
By @Angostura - 8 months
I kept waiting for the guy to actually paste the code somewhere
By @binary132 - 8 months
Signal handler integrations are underrated and great.