How to Troubleshoot Intermittent Problems



Posted: Tuesday, September 26, 2006

by
Perth Copywriting Services

One of the biggest challenges facing the service and support engineer is locating and resolving the Intermittent System Problem. By its nature, this type of problem does not usually present itself in a consistent manner. This means that you can't easily track it down using the Split and Test or Top Down problem location techniques because these methods require the problem to be reproducible.

Intermittent problems affect hardware, operating systems (both on disk and in the BIOS or CMOS), and software applications. I've even seen this problem surface in diagnostics software!

There are a few issues encountered when trying to identify and repair an intermittent fault:

  1. Where to start looking

     
  2. What to look for

     
  3. What procedures to follow
     
One of the most interesting facts about intermittent problems is that the majority of them don't occur completely at random. Although they might appear to, usually there's still a specific set of conditions that trigger them.

Consequently, many intermittent problems are reproducible!

Once you know the conditions that trigger the problem, you can locate and resolve it the usual way.

Types of Intermittent Problems

There are two types of Intermittent Problems, Reproducible and Random.

Reproducible Intermittent Problems

The Reproducible Intermittent Problem is always triggered by the same set of conditions and can therefore be reproduced once the conditions are found. This type of problem only appears to be random because the triggering conditions don't occur regularly.

For example, a software application may fail seemingly at random if it has a buffer overflow issue. Perhaps 99% of the data that gets stored to the buffer falls within tolerance but a bug in the code allows the data to exceed the buffer's storage capacity so that 1 time in 100 the buffer overflows, crashing the application. You can see that although this appears to occur randomly it's indeed reproducible under the right conditions. It's also an intermittent problem in that it doesn't occur every time data is stored to the buffer.

You could safely say that all intermittent software problems are reproducible. You just need to find the conditions that trigger them…

Random Intermittent Problems

Random Intermittent Problems are by far the most difficult to locate as they do occur completely at random. They are generally hardware faults where a component within the hardware device is starting to fail. This type of problem makes up only a very small percentage of all intermittent problems because, once the component has completely failed, the related problem usually becomes reproducible and easy to find.

Solution Strategies

The main problem you face is how to confirm and locate the fault. From the problem symptoms you will have an idea of where the problem is but unfortunately there are no easy shortcuts you can take when searching for it. However there are things you can do to make the search less frustrating.

  1. Always have a plan

    You should always have a plan of how you'll approach the problem. It should include things like "what conditions could trigger cause the problem and how would I test for them".
     
  2. Make a list of possible causes

    This will usually take the form of a checklist. The most important thing here is to be thorough. Don't skip anything that may reveal the problem. Take your time, when compiling your list. Try to think of everything that could possibly have anything to do with the problem you're trying to solve no matter how unlikely it may be. You may come across some intermittent problems only once in your lifetime.
     
  3. Plan how you're going to test your possible causes

    When you have compiled a list of potential causes you'll then need to work out how you'll test each of them. Again, be thorough. Use all the tools you have available such as hardware and software diagnostics, software utilities and test plans.
     
  4. Be systematic in how you approach the problem

    Make sure you stick to your plan when starting to troubleshoot the problem. If you think of additional things to check during your troubleshooting, make sure you add them to your plan. Write down everything you do and see.

    Especially make sure you fully document any errors you see during the testing procedures and how you produced them. Although these may not point directly at the faulty component, a few of them may give you a clue as to what's causing them. They will also be useful if you have to contact third party support for assistance.
     
Testing Procedures

There are two different types of tests you can carry out on system components.

The Functional Test

The first is a functional test. This is a quick test that checks the hardware or software functionality. This may or may not throw up a problem as it is usually just a basic test. A functional test for example, would be unlikely to detect the buffer overflow example above because the test data would probably all be within the application's tolerances. Use this test first but be prepared to retest if you don't find any problems.

The Comprehensive (Stress) Test

This is a full on, no holds barred, full test. It often involves pushing data through the systems of a type and volume you would rarely encounter in the field. The product diagnostics software normally includes an option to run this type of test. Often these tests take a long time to run and with some products, you have to option to repeat them continuously. This way you can leave a component running the tests overnight or even for a period of several days.

Other suggestions

Summary

The main thing to remember when troubleshooting intermittent problems is that you're trying to find the series of events that trigger them. Once found, the problem becomes reproducible and can be solved in the normal way. You must be patient and systematic and follow your troubleshooting plan! Be aware that although most intermittent problems can be reproduced, some are completely random. These are usually hardware issues. If you suspect faulty hardware, change it if you can.

About the Author

Robert is a Senior Technologist currently under contract with Curtin University in Perth, Western Australia. He has over 25 years of experience in the Avionics and IT industry and has supported a wide variety of technology. Robert is also the author of the book, Secrets of Troubleshooting Systems, which describes the ins and outs of fault diagnosis in complex systems.
This Article has been viewed 1,059 times. (Not updated in real-time.)
No comments yet.
We want your comments! If you can read this, you don't have javascript enabled, so you can't use this comment system. Please enable javascript.