June 25th, 2024

How GCC and Clang handle statically known undefined behaviour

Discussion on compilers handling statically known undefined behavior (UB) in C code reveals insights into optimizations. Compilers like gcc and clang optimize based on undefined language semantics, potentially crashing programs or ignoring problematic code. UB avoidance is crucial for program predictability and security. Compilers differ in handling UB, with gcc and clang showing variations in crash behavior and warnings. LLVM's 'poison' values allow optimizations despite UB, reflecting diverse compiler approaches. Compiler responses to UB are subjective, influenced by developers and user requirements.

Read original articleLink Icon
How GCC and Clang handle statically known undefined behaviour

The discussion on how compilers handle statically known undefined behavior (UB) in C code reveals insights into compiler optimizations and behaviors. Compilers like gcc and clang make assumptions based on undefined language semantics to optimize programs. When faced with UB, compilers may take different approaches, such as crashing the program or ignoring problematic code. The presence of UB can lead to unpredictable program behavior and security vulnerabilities, emphasizing the importance of avoiding UB. Compilers may choose to optimize away code exhibiting UB if it is not used, highlighting the role of dead code elimination. The handling of UB varies between compilers like gcc and clang, with differences in crash behavior and warning generation. The use of 'poison' values in LLVM enables optimizations even with UB, showcasing different compiler philosophies. Ultimately, the choice between crashing or continuing compilation in the face of UB is subjective and depends on compiler developers' preferences and user needs.

Related

My experience crafting an interpreter with Rust (2021)

My experience crafting an interpreter with Rust (2021)

Manuel Cerón details creating an interpreter with Rust, transitioning from Clojure. Leveraging Rust's safety features, he faced challenges with closures and classes, optimizing code for performance while balancing safety.

Own Constant Folder in C/C++

Own Constant Folder in C/C++

Neil Henning discusses precision issues in clang when using the sqrtps intrinsic with -ffast-math, suggesting inline assembly for instruction selection. He introduces a workaround using __builtin_constant_p for constant folding optimization, enhancing code efficiency.

Memory Model: The Hard Bits

Memory Model: The Hard Bits

This chapter explores OCaml's memory model, emphasizing relaxed memory aspects, compiler optimizations, weakly consistent memory, and DRF-SC guarantee. It clarifies data races, memory classifications, and simplifies reasoning for programmers. Examples highlight data race scenarios and atomicity.

Optimizing the Roc parser/compiler with data-oriented design

Optimizing the Roc parser/compiler with data-oriented design

The blog post explores optimizing a parser/compiler with data-oriented design (DoD), comparing Array of Structs and Struct of Arrays for improved performance through memory efficiency and cache utilization. Restructuring data in the Roc compiler showcases enhanced efficiency and performance gains.

Getting 100% code coverage doesn't eliminate bugs

Getting 100% code coverage doesn't eliminate bugs

Achieving 100% code coverage doesn't ensure bug-free software. A blog post illustrates this with a critical bug missed despite full coverage, leading to a rocket explosion. It suggests alternative approaches and a 20% coverage minimum.

Link Icon 11 comments
By @quelsolaar - 7 months
Im a member of the wg14 and the Undefined beahviour study group. We are producinga technical rapport about UB and one thing we specificly point out is that staticly known UB should be treated as an error, not as "this cant be called" aka "unreachable". From what i have seen LLVM sometimes issues poison when it should just issue an error. The implications can be quite severe. Instead of assuming that a UB is programmer error, it assumes its the intention of the programer that the code will never run, therfor it can use this information to make assumptions about branching behaviour.
By @Ontonator - 7 months
> Somewhat expectedly, gcc remains faithful to its crash approach, though note that it only inserts the crash when it compiles the division-by-zero, not earlier, like at the beginning of the function. […] The mere existence of UB in the program means all bets are off and the compiler could chose to crash the function immediatley upon entering it.

GCC leaves the print there because it must. While undefined behaviour famously can time travel, that’s only if it would actually have occurred in the first place. If the print blocks indefinitely then that division will never execute, and GCC must compile a binary that behaves correctly in that case.

By @phoe-krk - 7 months
Needs the "how" in the title restored; see https://news.ycombinator.com/item?id=40728590.
By @ainar-g - 7 months
> While the compiled programs stayed the same, we no longer get a warning (even with -Wall), even though both compilers can easily work out statically (e.g. via constant folding) that a division by zero occurs [4].

Are there any reasons why that is so? Do compilers not reuse the information they gather during compilation for diagnostics? Or is it a deliberate decision?

By @juliangmp - 7 months
The way the compilers abuse undefined behavior to "optimize" away sanity checks we explicitly put in is just downright insane to me
By @chrisjj - 7 months
> clang used the fact that division by zero is undefined and thus argc must not be zero to entirely remove the condition if (argc == 0), knowing this case can never happen [2].

Serious logic error, surely. "Can never happen" does not follow.

By @dathinab - 7 months
> language specification doesn't define what should happen during execution.

this is subtile misleading unspecified and undefined behavior are not quite the same

UB means it's (simplified) specified that the compiler is allowed to do whatever it wants (or more concretely in most cases is allowed to assume that a specific thing is impossible when optimizing to a point where of it does happen pretty much anything can happen including things like it seeming that an int is two different values at the same time or (which might still be one of the more harmless wtfs which can happen))

By @uecker - 7 months
Note that this article is incorrect about "The mere existence of UB in the program means all bets are off and the compiler could chose to crash the function immediatley upon entering it.". In C this is not true.
By @shultays - 7 months

  int i = 0;
  // nullptr dereference
  int ub = *(int*)i;
It is implementation defined I think, not UB. 0 could be a valid address
By @pif - 7 months
This post has no point.

Developers must avoid UB in their code, at all costs. Wondering what a specific compiler will do with your UB code is useless: as soon as you realise you have UB, go and fix it!

By @rishav_sharan - 7 months
To me, the very fact that one of the most used Programming languages ever, has undefined behavior - boggles the mind. IMO It is one of the worst parts of C (other than using pointers for Arrays, its string implementation and 0 based indices).

I really don't understand why the C standards body can't just define what should be the intended failure behavior in case of specific UB cases, and then the compiler developers to onboard this spec. There should be no impact to backwards compat because

a. this will be a new C version b. No program ever should be defined around the UB behavior

But UB still exist in 2024 and will likely do till I am too old to whine on the internet about it.