August 5th, 2024

Debugging a rustc segfault on Illumos

The author debugged a segmentation fault in the Rust compiler on illumos while compiling `cranelift-codegen`, using various tools and collaborative sessions to analyze the issue within the parser.

Read original articleLink Icon
AppreciationCuriosityNostalgia
Debugging a rustc segfault on Illumos

The article discusses the author's experience debugging a segmentation fault in the Rust compiler while working on the illumos operating system, specifically within the context of the Helios distribution used at Oxide. The author encountered a consistent crash (SIGSEGV) when attempting to compile the `cranelift-codegen` library. To address the issue, the author utilized various debugging tools available in illumos, including the Modular Debugger (mdb) to analyze core dumps generated during the crash. The debugging session was collaborative, involving colleagues during a virtual meetup. The author explained the bootstrapping process of the Rust compiler, which is self-hosting and requires careful management of compiler versions. The investigation revealed that the crash occurred within the Rust compiler's parser, specifically during a recursive descent parsing operation. The author highlighted the importance of examining CPU registers and the call stack to understand the state of the program at the time of the crash. The article serves as a guide for technologists interested in systems programming and debugging, providing insights into the tools and methodologies used in the process.

- The author debugged a segmentation fault in the Rust compiler on illumos.

- The crash was consistent and occurred while compiling `cranelift-codegen`.

- Collaborative debugging sessions were held with colleagues to address the issue.

- The Rust compiler's bootstrapping process was explained, emphasizing its self-hosting nature.

- The crash was traced to the Rust parser, highlighting the challenges of recursive descent parsing.

Related

My experience crafting an interpreter with Rust (2021)

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Mix-testing: revealing a new class of compiler bugs

Mix-testing: revealing a new class of compiler bugs

A new "mix testing" approach uncovers compiler bugs by compiling test fragments with different compilers. Examples show issues in x86 and Arm architectures, emphasizing the importance of maintaining instruction ordering. Luke Geeson developed a tool to explore compiler combinations, identifying bugs and highlighting the need for clearer guidelines.

Rust for Filesystems

Rust for Filesystems

At the 2024 Linux Summit, Wedson Almeida Filho and Kent Overstreet explored Rust for Linux filesystems. Rust's safety features offer benefits for kernel development, despite concerns about compatibility and adoption challenges.

How to Compile Your Language – Guide to implement a modern compiler for language

How to Compile Your Language – Guide to implement a modern compiler for language

This guide introduces programming language design and modern compiler implementation, emphasizing language purpose, syntax familiarity, and compiler components, while focusing on frontend development using LLVM, with source code available on GitHub.

Crafting Interpreters with Rust: On Garbage Collection

Crafting Interpreters with Rust: On Garbage Collection

Tung Le Vo discusses implementing a garbage collector for the Lox programming language using Rust, addressing memory leaks, the mark-and-sweep algorithm, and challenges posed by Rust's ownership model.

AI: What people are saying
The comments on the article reflect a mix of appreciation for the author's insights and technical discussions about debugging and compiler behavior.
  • Many readers found the article informative and well-structured, praising the author's ability to explain complex topics.
  • There is a consensus on the importance of understanding discrepancies in debugging, with some comments highlighting the challenges of post-mortem analysis.
  • Concerns were raised about the default behavior of core dumps on Unix systems, suggesting a need for better security practices.
  • Several commenters reminisced about past experiences with debugging, noting it as a "lost art" in modern development.
  • Technical discussions included comparisons between different operating systems' handling of stack management and debugging tools.
Link Icon 14 comments
By @bcantrill - 6 months
I am (obviously?) biased, but this is a great read by Rain, as it takes the reader through not just some of the illumos tooling, but also how compilers need to bootstrap themselves -- and why heterogeneous platforms are important. (As Rain elaborates in the piece, this issue was seen on illumos, but is in fact lurking on other platforms.)
By @deathanatos - 6 months
(heavily paraphrasing)

> [the core dump is supposed to be in the CWD, and named core, but isn't; what gives?]

Followed by,

  $ find / -name core -type f
Is a sort of hilarious brute force solution. But it demonstrates a particular kind of problem, where

  /-- requires -- evidence
  |                   ^
  v                   |
  answer -- requires -/
These are pesky. The brute force search is a good idea, in that it breaks that cycle of almost needing to know the answer in order to discover it. (Unless you can surmise that the CWD is the crate dir, but let's assume that we don't want to depend on having such a moment of sheet "eureka!".)

> But there are also other bits of evidence that this theory doesn’t explain, or even cuts against. (This is what makes post-mortem debugging exciting! There are often contradictory-seeming pieces of information that need to be explained.)

I wish more people appreciated this; too many people are apt to sweet such discrepancies under the rug. This post does a good job on not just following through on them, but also showing how figuring some of them out ("why is our stack weird?") leads to the key insights: "oh we're using stacker and … $the_bug".

I do wonder how the author managed to notice that line in a 1.5k line stack trace, though. The "abrupt" address change would have probably gone unnoticed by me. (The only saving grace being a.) it's close to the bottom b.) most of the rest is repetitive, an artifact of a recursive descent parser recursing, and if we just consider that repetition "one chunk", it gets a lot smaller. I still dunno if I'd've seen it, though.)

By @dwattttt - 6 months
To rustc not calling stacker enough/at the right times, the behaviour on MSVC/Windows is for the compiler to rely on hitting the OS's guard page to extend the stack (rather than growing it yourself), but also for the compiler to emit a special routine in any function that uses more than a page of stack frame (to make sure the first thing the function does is poke every page in order, so the OS can grow the stack the right amount).
By @fch42 - 6 months
Man this brings back so many memories :-)

Definitely a fun read. Debugging crashes has, in the last decade or so, become something a bit like a "lost art". Noone looks at coredumps in the cloud ...

I don't want to outdo you on Solaris debugging (plenty of old-time Solaris folks at Oxide who are totally capable to show how to get things like open files and their contents from a coredump, or how to configure the system to include those should it not be there ... etc ... etc ... Solaris has the best coredumps for all that's worth ...).

A note on the fix side of things though, while adding pthread_get_attr_np() for stack location/size gives Solaris the Linux interface, it already has its own for those - pthread_attr_getstack{size,addt}(), see https://docs.oracle.com/cd/E19455-01/806-5257/6je9h032l/inde... - I happen to remember this because I used this decades ago somewhere in the Solaris name lookup code to choose at runtime between using alloca() and malloc() ... don't ask. Those were different times.

By @unwind - 6 months
Great read, thanks!

One minor meta point if the author is (still) around: there is something strange with the styling of the hexadecimal literals in the code. Instead of having the prefix "0x", they look like "0×" even though they seem to be normal x:es in the source.

Edit: Firefox 128.0.3 on Linux, btw.

By @CodesInChaos - 6 months
> Generally, on Unix systems the default is to generate a file named core in the current directory of the crashing process.

Sounds like a horrible default. That's a security risk (working directory might be readable by untrusted users), and pollutes a random directory with a file that could cause problems for other applications processing files in that directory.

A fixed location inside the user's home directory feels like a much better choice to me.

By @sbt567 - 6 months
Amazing read! This article beautifully guides you through each step and make sure you did get enough context for the next step. Bookmarking this!
By @rnd0 - 6 months
Pleasantly surprised to see Illumos ...anywhere.
By @aconz2 - 5 months
Very nice writeup and I appreciate the effort put into showing the process. I got nerd sniped yesterday playing around with how to find the isle_opt.rs filepath from the core file and didn't succeed but left some notes on scripting with lldb here https://gist.github.com/aconz2/aef366a7b198b8ac151df147fec32...
By @jrpelkonen - 6 months
This is a very interesting and thorough investigation. Highly recommended!
By @zifpanachr23 - 6 months
Love to see some dump reading content! It's an underappreciated skill and brings back some good (or bad depending on your perspective) memories!
By @shrubble - 6 months
Curious if truss , which was used, or Dtrace would give you the syscalls in a nicer format for this application?
By @eqvinox - 6 months
$G, $r, $C ...

... mdb sure has a full-on "oldskool" CLI. I don't think that's a good thing, from a perspective of tool accessibility to developers...

By @tombert - 6 months
Extremely tangential, but what does something like Illumos/Solaris buy you in 2024 over something like FreeBSD or Linux?

This isn't some passive aggressive gotcha, I'm actually curious what people prefer about the Solaris distros nowadays. I know Zones and ZFS are cool, but FreeBSD supports Jails and ZFS out of the box, but maybe there are cool features I'm not aware of.