Debug Forensics: Troubleshooting electronic design issues before production
For electronic and software design houses, the cost of non-conformance can be significant if a pre-production bug isn’t resolved.
The consequences of poor and incorrect bug diagnosis are substantial as waste, inefficiencies and dissatisfaction occur when the wrong root cause is identified; lost engineering time, working components being unnecessarily replaced, frustrated engineers and unhappy customers due to project delays and bugs that aren’t fixed.
All of these issues can be addressed via Debug Forensics however and ByteSnap’s Director, Dunstan Power, explains why this CSI-style approach to optimising electronic design issues is the best method to use when diagnosing a bug, rather than just jumping straight in and guessing what the problem is.
Debug Forensics is adopting a meticulous approach to investigating bugs to identify the root cause before acting on that information to solve it. By applying a more rigorous repeatable method to debugging, better results are achieved, rather than just chaotically coming to the wrong conclusions. To make life easier, design teams should consider using free bug tracking tools like Bugzilla or The Bug Genie as they force you into a forensic rigorous approach of debugging.
Reproduce the problem the correct way and use automation
Often the hardest step in debugging is reproducing the problem, especially for intermittent bugs. Bug reports are often very poor with scant detail and environmental factors are critical. We design and test devices in offices or labs, but a product could be used in the middle of the rainforest, out at sea or in a car park.
The attached equipment is important. When we work in offices, we have a specific set of devices that we are attaching our unit to under testing, but the end-user may be using a completely different set of equipment. Sadly, there is also user error, so a checklist is key. Push the user to create a set of steps, one at a time, to reproduce the problem they reported, so you know exactly what happened and can recreate the same to test., checking each stage and result off as you go.
Something that we like to do is automate the test setup. Often, we use Arduino to power boards on and off as a common bug is lack of boot up or the board not booting up one time out of a hundred. The Arduino automates turning on and off over a selected period so we can use it to test the attached equipment to reproduce the problem and to help identify the bug.
Gathering evidence
We now have a system that is reproducing the bug, hopefully, and not just intermittently. For example, if the bug only happens once every five days, you may see it and immediately try and fix it. In this case, you might miss out on gathering some of the evidence of exactly what happens when it went wrong.
It’s best practice to look dispassionately at it, and then gather and record the data so that you can replay the step for consistency. Go through the bugs with team members as, even if they are not experts, as their questions can sometimes lead you to the light bulb moment that’s needed.
A lot of bugs, in my opinion and experience, turn out to be due to something simple, in retrospect. Often when you're working on a bug it can be as simple as an electrostatic discharge or some bizarre EMC or radio issue. Human error is possible too. If it's a schematic design, perhaps a chip has been pinned in the wrong way or the soldering was bad.
In a complex embedded system, we can have the electronics of a unit under test with firmware running on it, which might then plug into a display system and one that runs cloud software, for instance. So, if it is not clear where the bug is, it's worthwhile breaking the problem down into its constituent parts.
Applying the fix
The key to applying a fix is to be able to replicate the break. That may sound obvious, but if you had to turn it on a thousand times to recreate the breakage, turning it on once or twice isn't going to be enough times to identify the problem. If you had a bug that occurred five times out of a hundred and the system doesn't boot five times out of a hundred and you turn it on and it boots, there's a 95% chance that you haven't fixed the issue. That just means it normally booted.
If you turn it on twice and boot it twice, you're down to about a 90% probability you haven't fixed the bug – it's just luck that it's worked both times and so on. In the third test, it becomes 85%. And, if we want to get to a 99% confidence interval or to be 99% confident that we fixed it, we basically want to say that 0.95 or the 5% failure rate that we had before, to the power of X, is less than 1%.
That's where the log base comes in. A log base of 95 of 1% gives you 89.78. If I turn this system on and off 90 times, and it doesn't fail once, I'm 99% confident that I fixed the issue. That's for a specific sort of example failure mechanism, but there’s other relevant maths to debugging – mean time before failure, which is more commonly used with production, where we look at what's the weakest link in this product and how likely is it to fail in say 10 years.
If there is a random bug that causes your system to fail at a random time, the longer you leave it on, the more confident you are that you fixed it. But it's an exponential equation so you can never be 100% sure you fixed it by just leaving it on.
Bugs do not EVER just ‘go away’
Sometimes bugs appear and then they magically disappear. The problem, however, is that the same bug can then recur at any time.
Remember – if you see a bug, unless you've actually acted yourself to fix it, that bug is still there and could come back to bite you at a more inopportune time.
Using free bug tracking tools like The Bug Genie or Bugzilla, or a bug tracking system, is important as you can log all bugs. Even if you only see a bug once, don't forget about it – because quite likely you'll need to be coming back to it at some time in the future.
We have seen many benefits from deploying Debug Forensics. In addition to improving troubleshooting efficiency and seeing reductions in waste, this method has delivered positive results in the areas of reduced time to market, improved customer satisfaction, better project profitability and enhanced competitiveness.
By using our Debug Forensics method, we take a structured and meticulous approach to troubleshoot both hardware and software designs. This has established a consistent protocol from our most senior engineers through to the graduate members of the team.