June 27th, 2024

Debugging hardware is hard

Debugging hardware can be complex. A case study involving communication problems between STM32 MCU and ESP32 WiFi chips in Pickup device revealed an unexpected glitch in the STM32's auto-calibration feature affecting UART communication. Disabling it resolved the issue, emphasizing the need for thorough hardware and software analysis.

Read original articleLink Icon
Debugging hardware is hard

Debugging hardware can be challenging, as illustrated by a detailed account of debugging a communication issue between two main chips in a device called Pickup. The problem involved the STM32 MCU and ESP32 WiFi radio chips. Despite meticulous checks and tests, the root cause was traced to an unexpected glitch related to the auto-calibration feature of the MSI clock on the STM32 chip. Enabling this feature caused glitches in the UART communication, impacting the receiver's performance. Disabling the auto-calibration resolved the issue, leading to stable and reliable communication between the chips. The debugging process involved extensive hardware and software analysis, including the use of tools like debug consoles, logic analyzers, and oscilloscopes. After implementing the necessary adjustments, the system achieved a high success rate in data transfers, demonstrating the importance of a comprehensive understanding of both software and hardware aspects in resolving complex technical issues.

Link Icon 11 comments
By @ChrisMarshallNY - 7 months
True, dat.

My first job, was as a bench tech for an RF/microwave manufacturer (defense).

I think that experience made me a much better debugger, when I switched to software.

A lot of my work was analog (Spectrum Analyzers, Oscilloscopes, Signal Generators, etc.), but we also did a lot of digital debugging.

One of the coolest tools we had, was what was called an "ICE" (In-Circuit Emulator).

You yanked out the processor, and plugged in the ICE, and it replaced the processor. You could view everything going on, inside the processor, registers, accumulators, the whole kitchen sink.

These days, it would be impossible to make an ICE for current processors (although AI-assisted design might give us some surprises).

A lot of issues came about, at the intersection of analog and digital. Ringing could wreak havoc on digital buses, and, in those days, we weren't as good at handling GHz frequencies, as we are now. Every solder burr would become a microwave broadcast antenna.

By @barbegal - 7 months
Someone else had a similar issue 6 years ago https://electronics.stackexchange.com/questions/334012/hsi-a...

It sounds like the sampling clock frequency is not what is expected (but that's quite easy to check based on the transmitted signal so I'm quite confused)

UARTs are nice if you are constrained on pins but SPI is always a safer bet where you don't have the necessary high accuracy clocks.

By @mattegan - 7 months
Great writeup! You show things are difficult to debug even when you have a board where all of your signals of interest are easily accessible.

It's a bad week when you have a bug that only is reproducible on form-factor hardware. Imagine something like a tiny earbud where the only pin accessible while the device is assembled is a single UART (bidirectional) pin? Ouch. Then, if you can manage to disassemble the earbud - the PCBs are usually so small most signals never appear on the outer layers - so you can't probe them even if you want to! Oh, and the issue is only appearing on one out of every few thousand earbuds? Better not break your failing unit while taking it apart! Good luck!

Just watched the Kickstarter video too -- looks like a great product y'all! Best of luck :)

By @userbinator - 7 months
I couldn’t find anything in the docs or on the Internet about why this might happen with the autocal, and there’s nothing that details exactly how it works either.

A quick search found this document on the internal RC oscillator calibration, which explains all you need to know: http://nic.vajn.icu/PDF/STMicro/ARM/STM32F0/STM32F0xx_intern...

It is recommended to stop all application activities before the calibration process, and to restart them after calling the calibration functions. Therefore, the application has to stop the communications, the ADC measurements and any other processes (except when using the ADC for the calibration, refer to Step 5. below). These processes normally use clock configurations that are different from those used in the calibration process. Otherwise, errors might be introduced in the application: errors while reading/sending frames, ADC reading errors since the sampling time has changed, and so on.

By @upwardbound - 7 months
Is it possible that the autocal is in the process of shifting the phase the clock it controls to match the reference clock, and that the user is supposed to wait until that's done before running phase-sensitive operations? I'm unfamiliar with the chip in question, just making guesses or shots in the dark.
By @utensil4778 - 7 months
I've had enough bad experiences with ST to just avoid their MCUs altogether when possible. Last year I wasted several days trying to figure out why a certain ST chip wouldn't respond to the programmer I was using. The chip claimed to support flashing from UART, but the bootloader just never answered. It responded to other interfaces like ST's two wire interface, whatever it was called. Searching around online, it seems like this chip has a silicon bug or a bad ROM and ST is just happy to keep selling it as is for years. They haven't even published errata acknowledging that it's broken and doesn't answer on all interfaces.
By @lemonlime0x3C33 - 7 months
I enjoyed reading through your debugging process and as someone who has been trying to debug a custom board for a few weeks now I feel your pain. I still cannot say that my issue is hardware, firmware, or software -_-

I do have some UART devices that really seem to like when I just disconnect and reconnect the GND wire when they start to act up.

By @russdill - 7 months
It would be incredibly useful to output the user clock and monitor it with a logic analyzer.
By @skadamat - 7 months
Reminds me of Bret Victor's Seeing Spaces talk: https://www.youtube.com/watch?v=klTjiXjqHrQ
By @tonetegeatinst - 7 months
As someone who is interested In hardware engineering and reverse engineering....this truly is an understatement.

I have an old router I wanted to dump the firmware from as a learning experience to see if I could go from firmware dumping to finding a bug.

You got to constantly question if its your hardware or the device that's faulty....you got to double check and make sure everything is connected.

Want to do low level silicon reverse engineering? Yeah that's not cheap as the tool, chemicals, and PPE is very expensive.

But I'd argure I shouldn't treat the hardware world as magic. And with how expensive these devices are, some of it being justified, other times a company just sets insane markup on items like nvidia, its reasonable to want to learn how this stuff works and how one could do it themselves.

This hardware reverse engineering is also how we find/look for potential security issues or backdoors.

By @bsder - 7 months
What are those PCB holders he is using?