We Fixed the Glitch
Around the first of the year our contract manufacturer contacted us about an urgent problem with HackRF One production. They'd had to stop production because units coming off the line were failing at a high rate. This was quite a surprise because HackRF One is a mature product that has been manufactured regularly for a few years. I continued to find surprises as I went through the process of troubleshooting the problem, and I thought it made a fascinating tale that would be worth sharing.
The reported failure was an inability to write firmware to the flash memory on the board. Our attention quickly turned to the flash chip itself because it was the one thing that had changed since the previous production. The original flash chip in the design had been discontinued, so we had selected a replacement from the same manufacturer. Although we had been careful to test the new chip prior to production, it seemed that somehow the change had resulted in a high failure rate.
Had we overlooked a failure mode because we had tested too small a quantity of the new flash chips? Had the sample parts we tested been different than the parts used in the production? We quickly ordered parts from multiple sources and had our contract manufacturer send us some of their parts and new boards for testing. We began testing parts as soon as they arrived at our lab, but even after days of testing samples from various sources we were unable to reproduce the failures reported by the contract manufacturer.
At one point I thought I managed to reproduce the failure on one of the new boards, but it only happened about 3% of the time. This failure happened regardless of which flash chip was used, and it was easy to work around by retrying. If it happened on the production line it probably wouldn't even be noticed because it was indistinguishable from a simple user error such as a poor cable connection or a missed button press. Eventually I determined that this low probability failure mode was something that affected older boards as well. It is something we might be able to fix, but it is a low priority. It certainly wasn't the same failure mode that had stopped production.
It seemed that the new flash chip caused no problems, but then what could be causing the failures at the factory? We had them ship us more sample boards, specifically requesting boards that had exhibited failures. They had intended to send us those in the first shipment but accidentally left them out of the package. Because the flash chip was so strongly suspected at the time, we'd all thought that we'd be able to reproduce the failure with one or more of the many chips in that package anyway. One thing that had made it difficult for them to know which boards to ship was that any board that passed testing once would never fail again. For this reason they had deemed it more important to send us fresh, untested boards than boards that had failed and later passed.
When the second batch of boards from the contract manufacturer arrived, we immediately started testing them. We weren't able to reproduce the failure on the first board in the shipment. We weren't able to reproduce the failure on the second board either! Fortunately the next three boards exhibited the failure, and we were finally able to observe the problem in our lab. I isolated the failure to something that happened before the actual programming of the flash, so I was able to develop a test procedure that left the flash empty, avoiding the scenario in which a board that passed once would never fail again. Even after being able to reliably reproduce the failure, it took several days of troubleshooting to fully understand the problem. It was a frustrating process at the time, but the root cause turned out to be quite an interesting bug.
Although the initial symptom was a failure to program flash, the means of programming flash on a new board is actually a multi-step process. First the HackRF One is booted in Device Firmware Upgrade (DFU) mode. This is done by holding down the DFU button while powering on or resetting the board. In DFU mode, the HackRF's microcontroller executes a DFU bootloader function stored in ROM. The host computer speaks to the bootloader over USB and loads HackRF firmware into RAM. Then the bootloader executes this firmware which appears as a new USB device to the host. Finally the host uses a function of the firmware running in RAM to load another version of the firmware over USB and onto the flash chip.
I found that the failure happened at the step in which the DFU bootloader launches our firmware from RAM. The load of firmware over USB into RAM appeared to work, but then the DFU bootloader dropped off the bus and the USB host was unable to re-enumerate the device. I probed the board with a voltmeter and oscilloscope, but nearly everything looked as expected. There was a fairly significant voltage glitch on the microcontroller's power supply (VCC), but a probe of a known good board from a previous production revealed a similar glitch. I made a note of it as something to investigate in the future, but it didn't seem to be anything new.
I connected a Black Magic Probe and investigated the state of the microcontroller before and after the failure. Before the failure, the program counter pointed to the ROM region that contains the DFU bootloader. After the failure, the program counter still pointed to the ROM region, suggesting that control may never have passed to the HackRF firmware. I inspected RAM after the failure and found that our firmware was in the correct place but that the first 16 bytes had been replaced by 0xff. It made sense that the bootloader would not attempt to execute our code because it is supposed to perform an integrity check over the first few bytes. Since those bytes were corrupted, the bootloader should have refused to jump to our code.
I monitored the USB communication to see if the firmware image was corrupted before being delivered to the bootloader, but the first 16 bytes were correct in transit. Nothing looked out of the ordinary on USB except that there was no indication that the HackRF firmware had started up. After the bootloader accepted the firmware image, it dropped off the bus, and then the bus was silent.
As my testing progressed, I began to notice a curious thing, and our contract manufacturer reported the very same observation: The RF LED on the board sometimes was dimly illuminated in DFU mode and sometimes was completely off. Whenever it was off, the failure would occur; whenever it was dimly on, the board would pass testing. This inconsistency in the state of the RF LED is something that we had observed for years. I had never given it much thought but assumed it may have been caused by some known bugs in reset functions of the microcontroller. Suddenly this behavior was very interesting because it was strongly correlated with the new failure! What causes the RF LED to sometimes be dimly on at boot time? What causes the new failure? Could they be caused by the same thing?
I took a look at the schematic which reminded me that the RF LED is not connected to a General-Purpose Input/Output (GPIO) pin of the microcontroller. Instead it directly indicates the state of the power supply (VAA) for the RF section of the board. When VAA is low (below about 1.5 Volts), the RF LED is off. When VAA is at or near 3.3 Volts (the same voltage as VCC), the RF LED should be fully on. If the RF LED is dimly on, VAA must be at approximately 2 Volts, the forward voltage of the LED. This isn't enough voltage to power the chips in the RF section, but it is enough to dimly illuminate the LED.
VAA is derived from VCC but is controlled by a MOSFET which switches VAA on and off. At boot time, the MOSFET should be switched off, but somehow some current can leak into VAA. I wasn't sure if this leakage was due to the state of the GPIO signal that controls the MOSFET (!VAA_ENABLE) or if it could be from one of several digital control signals that extend from the VCC power domain into the VAA power domain. I probed all of those signals on both a good board and a failing board but didn't find any significant differences. It wasn't clear why VAA was sometimes partially charged at start-up, and I couldn't find any indication of what might be different between a good board and a bad board.
One thing that was clear was that the RF LED was always dimly illuminated immediately after a failure. If I reset a board into DFU mode using the reset button after a failure, the RF LED would remain dimly lit, and the failure would be avoided on the second attempt. If I reset a board into DFU mode by removing and restoring power instead of using the reset button, the RF LED state became unpredictable. The procedural workaround of retrying with the reset button would have been sufficient to proceed with manufacturing except that we were nervous about shipping boards that would give end users trouble if they need to recover from a load of faulty firmware. It might be a support nightmare to have units in the field that do not provide a reliable means of restoring firmware. We certainly wanted to at least understand the root cause of the problem before agreeing to ship units that would require users to follow a procedural workaround.
Meanwhile I had removed a large number of components from one of the failing boards. I had started this process after determining that the flash chip was not causing the problem. In order to prove this without a doubt, I entirely removed the flash chip from a failing board and was still able to reproduce the failure. I had continued removing components that seemed unrelated to the failure just to prove to myself that they were not involved. When investigating the correlation with VAA, I tried removing the MOSFET (Q3) and found that the failure did not occur when Q3 was absent! I also found that removal of the ferrite filter (FB2) on VAA or the capacitor (C105) would prevent the failure. Whenever any of these three components was removed, the failure could be avoided. I tried cutting the trace (P36) that connects the VAA MOSFET and filter to the rest of VAA. Even without any connection to the load, I could prevent the failure by removing any of those three components and induce the failure by restoring all three. Perhaps the charging of VAA was not only correlated with the failure but was somehow the cause of the failure!
This prompted me to spend some time investigating VAA, VCC, and !VAA_ENABLE more thoroughly. I wanted to fully understand why VAA was sometimes partially charged and why the failure only happened when it was uncharged. I used an oscilloscope to probe all three signals simultaneously, and I tried triggering on changes to any of the three. Before long I found that triggering on !VAA_ENABLE was most fruitful. It turned out that !VAA_ENABLE was being pulled low very briefly at the approximate time of the failure. This signal was meant to remain high until the HackRF firmware pulls it low to switch on VAA. Why was the DFU bootloader toggling this pin before executing our firmware?
Had something changed in the DFU bootloader ROM? I used the Black Magic Probe to dump the ROM from one of the new microcontrollers, but it was the same as the ROM on older ones. I even swapped the microcontrollers of a good board and a bad board; the bad board continued to fail even with a known good microcontroller, and the good board never exhibited a problem with the new microcontroller installed. I investigated the behavior of !VAA_ENABLE on a good board and found that a similar glitch happened prior to the point in time at which the HackRF firmware pulls it low. I didn't understand what was different between a good board and a bad board, but it seemed that this behavior of !VAA_ENABLE was somehow responsible for the failure.
The transient change in !VAA_ENABLE caused a small rise in VAA and a brief, very small dip in VCC. It didn't look like this dip would be enough to cause a problem on the microcontroller, but, on the assumption that it might, I experimented with ways to avoid affecting VCC as much. I found that a reliable hardware workaround was to install a 1 kΩ resistor between VAA and VCC. This caused VAA to always be partially charged prior to !VAA_ENABLE being toggled, and it prevented the failure. It wasn't a very attractive workaround because there isn't a good place to install the resistor without changing the layout of the board, but we were able to confirm that it was effective on all boards that suffered from the failure.
Trying to determine why the DFU bootloader might toggle !VAA_ENABLE, I looked at the documented functions available on the microcontroller's pin that is used for that signal. Its default function is GPIO, but it has a secondary function as a part of an external memory interface. Was it possible that the DFU bootloader was activating the external memory interface when writing the firmware to internal RAM? Had I made a terrible error when I selected that pin years ago, unaware of this bootloader behavior?
Unfortunately the DFU bootloader is a ROM function provided by the microcontroller vendor, so we don't have source code for it. I did some cursory reverse engineering of the ROM but couldn't find any indication that it possesses the capability of activating the external memory interface. I tried using the Black Magic Probe to single step through instructions, but it wasn't fast enough to avoid USB timeouts while single stepping. I set a watchpoint on a register that should be set when powering up the external memory interface, but it never seemed to happen. Then I tried setting a watchpoint on the register that sets the pin function, and suddenly something very surprising was revealed to me. The first time the pin function was set was in my own code executing from RAM. The bootloader was actually executing my firmware even when the failure occurred!
After a brief moment of disbelief I realized what was going on. The reason I had thought that my firmware never ran was that the program counter pointed to ROM both before and after the failure, but that wasn't because my code never executed. A ROM function was running after the failure because the microcontroller was being reset during the failure. The failure was occurring during execution of my own code and was likely something I could fix in software! Part of the reason I had misinterpreted this behavior was that I had been thinking about the bootloader as "the DFU bootloader", but it is actually a unified bootloader that supports several different boot methods. Even when booting to flash memory, the default boot option for HackRF One, the first code executed by the microcontroller is the bootloader in ROM which later passes control to the firmware in flash. You don't hold down the DFU button to cause the bootloader to execute, you hold down the button to instruct the bootloader to load code from USB DFU instead of flash.
Suddenly I understood that the memory corruption was something that happened as an effect of the failure; it wasn't part of the cause. I also understood why the failure did not seem to occur after a board passed testing once. During the test, firmware is written to flash. If the failure occurs at any time thereafter, the microcontroller resets and boots from flash, behaving similarly to how it would behave if it had correctly executed code that had been loaded via USB into RAM. The reason the board was stuck in a ROM function after a failure on a board with empty flash was simply that the bootloader was unable to detect valid firmware in flash after reset.
It seemed clear that the microcontroller must be experiencing a reset due to a voltage glitch on VCC, but the glitch that I had observed on failing boards seemed too small to have caused a reset. When I realized this, I took some more measurements of VCC and zoomed out to a wider view on the oscilloscope. There was a second glitch! The second glitch in VCC was much bigger than the first. It was also caused by !VAA_ENABLE being pulled low, but this time it was held low long enough to have a much larger effect on VCC. In fact, this was the same glitch that I had previously observed on known good boards. I then determined that the first glitch was caused by a minor bug in the way our firmware configured the GPIO pin. The second glitch was caused by the deliberate activation of !VAA_ENABLE.
When a good board starts up, it pulls !VAA_ENABLE low to activate the MOSFET that switches on VAA. At this time, quite a bit of current gets dumped into the capacitor (C105) in a short amount of time. This is a perfect recipe for causing a brief drop in VCC. I knew about this potential problem when I designed the circuit, but I guess I didn't carefully measure it at the time. It never seemed to cause a problem on my prototypes.
When a bad board starts up, the exact same thing happens except the voltage drop of VCC is just a little bit deeper. This causes a microcontroller reset, resulting in !VAA_ENABLE being pulled high again. During this brief glitch VAA becomes partially charged, which is why the RF LED is dimly lit after a failure. If VAA is partially charged before !VAA_ENABLE is pulled low, less current is required to fully charge it, so the voltage glitch on VCC isn't deep enough to cause a reset.
At this point I figured out that the reason the state of the RF LED is unpredictable after power is applied is that it depends on how long power has been removed from the board. If you unplug a board with VAA at least partially charged but then plug it back in within two seconds, VAA will still be partially charged. If you leave it disconnected from power for at least five seconds, VAA will be thoroughly discharged and the RF LED will be off after plugging it back in.
This sort of voltage glitch is something hardware hackers introduce at times as a fault injection attack to cause microcontrollers to misbehave in useful ways. In this case, my microcontroller was glitching itself, which was not a good thing! Fortunately I was able to fix the problem by rapidly toggling !VAA_ENABLE many times, causing VAA to charge more slowly and avoiding the VCC glitch.
I'm still not entirely sure why boards from the new production seem to be more sensitive to this failure than older boards, but I have a guess. My guess is that a certain percentage of units have always suffered from this problem but that they have gone undetected. The people programming the boards in previous productions may have figured out on their own that they could save time by using the reset button instead of unplugging a board and plugging it back in to try again. If they did so, they would have had a very high success rate on second attempts even when programming failed the first time. If a new employee or two were doing the programming this time, they may have followed their instructions more carefully, removing failing boards from power before re-testing them.
Even if my guess is wrong, it seems that my design was always very close to having this problem. Known good boards suffered from less of a glitch, but they still experienced a glitch that was close to the threshold that would cause a reset. It is entirely possible that subtle changes in the characteristics of capacitors or other components on the board could cause this glitch to be greater or smaller from one batch to the next.
Once a HackRF One has had its flash programmed, the problem is very likely to go undetected forever. It turns out that this glitch can happen even when a board is booted from flash, not just when starting it up in DFU mode. When starting from flash, however, a glitch-induced reset results in another boot from flash, this time with VAA charged up a little bit more. After one or two resets that happen in the blink of an eye, it starts up normally without a glitch. Unless you know what to look for, it is quite unlikely that you would ever detect the fault.
Because of this and the fact that we didn't have a way to distinguish between firmware running from flash and RAM, the failure was difficult for us to reproduce and observe reliably before we understood it. Another thing that complicated troubleshooting was that I was very focused on looking for something that had changed since the previous production. It turned out that the voltage glitch was only subtly worse than it was on the older boards I tested, so I overlooked it as a possible cause. I don't know that it was necessarily wrong to have this focus, but I might have found the root cause faster had I concentrated more on understanding the problem and less on trying to find things that had changed.
In the end I found that it was my own hardware design that caused the problem. It was another example of something Jared Boone often says. I call it ShareBrained's Razor: "If your project is broken, it is probably your fault.". It isn't your compiler or your components or your tools; it is something you did yourself.
Thank you to everyone who helped with this troubleshooting process, especially the entire GSG team, Etonnet, and Kate Temkin. Also thank you to the pioneers of antibiotics without which I would have had a significantly more difficult recovery from the bronchitis that afflicted me during this effort!