June 30th, 2024

The weirdest QNX bug I've ever encountered

The author encountered a CPU usage bug in a QNX system's 'ps' utility due to a 15-year-old bug. Debugging revealed a race condition, leading to code modifications and a shift towards open-source solutions.

Read original articleLink Icon
The weirdest QNX bug I've ever encountered

The blog post describes the author's encounter with a peculiar bug in firmware updates. The bug caused high CPU usage due to an infinite loop in the 'ps' utility on a QNX system. Through debugging, the author traced the issue to a 15-year-old bug in the closed-source 'ps' binary, which was fixed by modifying the source code of the utility. The bug surfaced due to changes in boot timing, leading to a race condition. The author decided to eliminate the use of 'ps' in non-interactive code to prevent the bug from reoccurring. Lessons learned include the persistence of old bugs, the impact of subtle changes on bug manifestation, and the importance of loop termination criteria. The closed-source nature of the QNX ecosystem posed challenges in debugging, highlighting the value of open-source solutions. The author's fix prevented the recurrence of the specific update problem. The post concludes with reflections on the challenges of closed-source systems and the need for robust coding practices to avoid similar issues in the future.

Related

Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code

Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code

A developer fixed a seven-year-old bug in an iPad accessory causing missed MIDI messages by optimizing a modulo operation. The bug's resolution improved the audio processor's efficiency significantly.

Vulnerability in Popular PC and Server Firmware

Vulnerability in Popular PC and Server Firmware

Eclypsium found a critical vulnerability (CVE-2024-0762) in Intel Core processors' Phoenix SecureCore UEFI firmware, potentially enabling privilege escalation and persistent attacks. Lenovo issued BIOS updates, emphasizing the significance of supply chain security.

I found an 8 years old bug in Xorg

I found an 8 years old bug in Xorg

An 8-year-old Xorg bug related to epoll misuse was found by a picom developer. The bug caused windows to disappear during server lock, traced to CloseDownClient events. Despite limited impact, the developer seeks alternative window tree updates, emphasizing testing and debugging tools.

The Dirty Pipe Vulnerability

The Dirty Pipe Vulnerability

The Dirty Pipe Vulnerability (CVE-2022-0847) in Linux kernel versions since 5.8 allowed unauthorized data overwriting in read-only files, fixed in versions 5.16.11, 5.15.25, and 5.10.102. Discovered through CRC errors in log files, it revealed systematic corruption linked to ZIP file headers due to a kernel bug in Linux 5.10. The bug's origin was pinpointed by replicating data transfer issues between processes using C programs, exposing the faulty commit. Changes in the pipe buffer code impacted data transfer efficiency, emphasizing the intricate nature of kernel development and software component interactions.

CVE-2021-4440: A Linux CNA Case Study

CVE-2021-4440: A Linux CNA Case Study

The Linux CNA mishandled CVE-2021-4440 in the 5.10 LTS kernel, causing information leakage and KASLR defeats. The issue affected Debian Bullseye and SUSE's 5.3.18 kernel, resolved in version 5.10.218.

Link Icon 12 comments
By @Animats - 5 months
"At this point, an intermezzo with some QNX history is in order. A bit more than a decade ago, the QNX source code was available to the public. Back then, QNX had a vibrant open source community. People would experiment with the kernel, write various useful utilities and help each other in forums. QNX even had a fully featured Desktop GUI, ran Firefox and was self-hosting, so you could develop for QNX right on QNX itself with full IDE and compiler support. It was beautiful."

"Then QNX was bought, source code access was revoked and the community largely withered away. Questions were increasingly asked via private support tickets directly to QNX, locked away from the public. QNX know-how becomes harder and harder to acquire, open source software for modern QNX releases is essentially non-existent and the driver situation is a catastrophe. The QNX kernel is the most beautiful and interesting kernel I have ever had the pleasure of working with, but it lies in the shackles of corporate ownership."

It's sad.

QNX was originally an independent company. During that period, anyone could get a free copy of QNX for personal use. It wasn't open source, but it was available. It's POSIX-compatible, so it was a supported target for Gnu, Firefox, and Eclipse. We used QNX for our DARPA Grand Challenge vehicle in 2003-2005, and all that code was developed on desktop QNX.

Then QNX was acquired by Harmon, the successor to Harmon-Kardon, which once made home audio components and pivoted to car audio. They were thinking car infotainment. Harmon didn't really know what to do with an operating system, especially since the big market was systems for industrial control and point of sale. So eventually they opened the source.

Then QNX was acquired by Blackberry, the early smartphone company. They closed the source, very suddenly. They even killed off the free version for personal and educational use. So all third party open source development stopped. Blackberry eventually shipped a phone that ran QNX, but they were not powerful enough as a company to keep a third phone standard going. So Blackberry went to Android.

Blackberry killed off the self-hosted desktop environment, and users now had to cross-compile from Windows.

And QNX became more of a niche product than ever.

By @arsome - 5 months
I actually ran across this issue myself, SIGQUIT'd the process, loaded it into a debugger and found the exact same problem. I can confirm the problem still exists on QNX 7.1. Fortunately we were moving off it, so I didn't think much more about it, but glad someone wrote it up.
By @nrclark - 5 months
QNX really needs to modernize if they want to survive. Their tooling ecosystem is stuck in 2008, and their kernel's performance is pretty low. IIRC, the kernel itself is also single-threaded, and can't take advantage of multiple CPUs (even if tasks can be SMP scheduled).

Their moat is supposedly their ASIL certification, but I see that value shrinking more and more over time for the following reasons:

1. If your product has a software-related failure, customers won't care about all of your certifications. Only the end product.

2. I'm not convinced that the QNX kernel is less buggy than the Linux kernel. Also, most failures don't tend to be kernel related.

By @bxparks - 5 months
I counted 417 comments on that page and scrolled through a few dozen. Every one of them was spam. That's pretty much the internet these days isn't it.

Other than that, the blog post was very interesting, I learned a bit of history of QNX, and concluded that I should avoid it.

By @the_panopticon - 5 months
I recall trying to debug a crash in QNX during the mid-90's. I was impressed by the svelte OS that could load from a 3.5" floppy. The failure scenario was coincident with one of my first tasks as a BIOS engineer and it entailed adding some custom error logging in System Management Mode (SMM). Luckily for me it turned out that I had forgotten to save/restore certain general purpose registers around my SMM logic. Fun times. SMM is pretty good at 'breaking' operating systems :)
By @ragnot - 5 months
Every developer I know (myself included) that has worked with QNX has a story about some insane bug that took significant effort to uncover. At this point, I would say the only reason one should look at QNX is for cost since it is pretty cheap. The low jitter on context-switching to the highest priority thread is a nice thing but the dev process is absolute garbage.
By @banish-m4 - 5 months
What I like about seL4 (although not a complete embedded dev platform) is formally-verification. QNX might have EAL4 in some configurations, but like most every other operating system on the planet, they haven't bothered to up their game by formally verifying it for correctness. This is a shame and entirely preventable with greater attention to testing and verification.
By @lfkdev - 5 months
What is going on with the comment section on this post?
By @tfrutuoso - 5 months
Great article, but the comments section of that blog is pure cancer. Jeez.
By @torginus - 5 months
Honestly this kinda shows me that no matter what degree of robustness we design into our systems (null saftety, memory safety, thread safety etc.), some types of system breaking bugs are unavoidable (such as DOSing the system by calling a system API function in an infinite loop), and are often impossible to distingushing from desired behavior.