How to Troubleshoot Intermittent Problems
Posted: Tuesday, September 26, 2006
by Robert Verstandig
Perth Copywriting Services
One of the biggest challenges facing the service and support engineer is locating and resolving the Intermittent System Problem. By its nature, this type of problem does not usually present itself in a consistent manner. This means that you can't easily track it down using the Split and Test or Top Down problem location techniques because these methods require the problem to be reproducible.
There are a few issues encountered when trying to identify and repair an intermittent fault:
- Where to start looking
- What to look for
- What procedures to follow
Consequently, many intermittent problems are reproducible!
Once you know the conditions that trigger the problem, you can locate and resolve it the usual way.
Types of Intermittent Problems
There are two types of Intermittent Problems, Reproducible and Random.
Reproducible Intermittent Problems
The Reproducible Intermittent Problem is always triggered by the same set of conditions and can therefore be reproduced once the conditions are found. This type of problem only appears to be random because the triggering conditions don't occur regularly.
For example, a software application may fail seemingly at random if it has a buffer overflow issue. Perhaps 99% of the data that gets stored to the buffer falls within tolerance but a bug in the code allows the data to exceed the buffer's storage capacity so that 1 time in 100 the buffer overflows, crashing the application. You can see that although this appears to occur randomly it's indeed reproducible under the right conditions. It's also an intermittent problem in that it doesn't occur every time data is stored to the buffer.
You could safely say that all intermittent software problems are reproducible. You just need to find the conditions that trigger them…
Random Intermittent Problems
Random Intermittent Problems are by far the most difficult to locate as they do occur completely at random. They are generally hardware faults where a component within the hardware device is starting to fail. This type of problem makes up only a very small percentage of all intermittent problems because, once the component has completely failed, the related problem usually becomes reproducible and easy to find.
Solution Strategies
The main problem you face is how to confirm and locate the fault. From the problem symptoms you will have an idea of where the problem is but unfortunately there are no easy shortcuts you can take when searching for it. However there are things you can do to make the search less frustrating.
- Always have a plan
You should always have a plan of how you'll approach the problem. It should include things like "what conditions could trigger cause the problem and how would I test for them".
- Make a list of possible causes
This will usually take the form of a checklist. The most important thing here is to be thorough. Don't skip anything that may reveal the problem. Take your time, when compiling your list. Try to think of everything that could possibly have anything to do with the problem you're trying to solve no matter how unlikely it may be. You may come across some intermittent problems only once in your lifetime.
- Plan how you're going to test your possible causes
When you have compiled a list of potential causes you'll then need to work out how you'll test each of them. Again, be thorough. Use all the tools you have available such as hardware and software diagnostics, software utilities and test plans.
- Be systematic in how you approach the problem
Make sure you stick to your plan when starting to troubleshoot the problem. If you think of additional things to check during your troubleshooting, make sure you add them to your plan. Write down everything you do and see.
Especially make sure you fully document any errors you see during the testing procedures and how you produced them. Although these may not point directly at the faulty component, a few of them may give you a clue as to what's causing them. They will also be useful if you have to contact third party support for assistance.
There are two different types of tests you can carry out on system components.
The Functional Test
The first is a functional test. This is a quick test that checks the hardware or software functionality. This may or may not throw up a problem as it is usually just a basic test. A functional test for example, would be unlikely to detect the buffer overflow example above because the test data would probably all be within the application's tolerances. Use this test first but be prepared to retest if you don't find any problems.
The Comprehensive (Stress) Test
This is a full on, no holds barred, full test. It often involves pushing data through the systems of a type and volume you would rarely encounter in the field. The product diagnostics software normally includes an option to run this type of test. Often these tests take a long time to run and with some products, you have to option to repeat them continuously. This way you can leave a component running the tests overnight or even for a period of several days.
Other suggestions
- If you suspect a piece of hardware is faulty and can change it, change it.
If you've narrowed down your search to a piece of hardware you can easily change, then change it for a known serviceable part. This is easier to do in computer systems where you will have easy access to components such as keyboards, mice and adaptors. It's not so easy to do if you're troubleshooting a network.
- Question the User
Make sure you thoroughly question the user to find out exactly what they were doing and what was running on the system at the time of the problem
- See if there's a timeframe where the problem occurs more frequently
See whether there is a particular time of the day where the problem occurs in more often. Plan to carry out your testing during this time.
- Check the Product Log Files
If log files are available, check for unusual entries that may be related, around the time the problem occurred. This could give you a lead as to which component is failing.
- Check the FAQ or Troubleshooting sections in the Product Documentation for symptoms of your problem.
If the product has a User or Administrators Guide see if there's a FAQ or troubleshooting section. It's possible that someone else may have seen your problem before.
- Check the Internet
Use a search engine such as Google or Altavista to see whether anyone else has come across your problem before. Check the developer's / Manufacturer's website support page and FAQ's if any. Check related newsgroups or third party product support pages.
- Contact the Developer / Manufacturer directly
Often you can either email the product support group or submit an online form detailing the symptoms of your problem directly to the manufacturer of the product. This is only helpful if you know for sure the particular product that's at fault.
- Never believe any product is 100% perfect
Occasionally you'll discover new bugs in hardware or software when you're trying to locate system faults. Never assume the products from your vendors are completely perfect – because they aren't.
The main thing to remember when troubleshooting intermittent problems is that you're trying to find the series of events that trigger them. Once found, the problem becomes reproducible and can be solved in the normal way. You must be patient and systematic and follow your troubleshooting plan! Be aware that although most intermittent problems can be reproduced, some are completely random. These are usually hardware issues. If you suspect faulty hardware, change it if you can.
About the Author
Robert is a Senior Technologist currently under contract with Curtin University in Perth, Western Australia. He has over 25 years of experience in the Avionics and IT industry and has supported a wide variety of technology. Robert is also the author of the book, Secrets of Troubleshooting Systems, which describes the ins and outs of fault diagnosis in complex systems.
This Article has been viewed 1,059 times. (Not updated in real-time.)
No comments yet.We want your comments! If you can read this, you don't have javascript enabled, so you can't use this comment system. Please enable javascript.