Your troubleshooting rules, approach to troubleshooting?

Question

Do you have any general rules that you fall back on when you troubleshoot a difficult network/hardware/software problem?

Eg: "I isolate the source of the problem by testing a peripheral with a second computer" or "I remove as much hardware as is possible to power up the device, and then add back components one by one until I can reproduce the problem", etc.

maybe i should edit the title. i just know someone is going to answer "thanks! i'm proud of it" ;-) — username, May 15 '09 at 23:24

score 16 · Accepted Answer · edited Aug 21 '21 at 13:40

Just a list of points I wrote down for myself after fighting with a problem for a while:

What is your primary goal ? Should be stated clearly and as concisely. The goal should be very particular. It should not be general. Preferably one sentence.
What is your problem ?
Is there just one problem or many ? If there are many, solve them one at a time.
Try to reproduce the problem with different conditions. Can it be reproduced in all possible conditions or not ? Does it say anything about the nature of the problem ?
If it is an urgent problem is there a workaround ? Try to find as many workarounds as possible.
Try to make as many guesses as possible on what is the cause of your problem.
Try to prove your guesses, experiment with the system.
Be consistent in what you're trying to do. Do one thing at a time.
Keep track of what you're doing, what you've already tried.
Do not deviate from your primary goal. Constantly check if you're still solving your main problem, not a differenet one.
Do not fixate either.

There also was a great list of debugging rules, it was in a PDF form with exaples and explanation for each of the rules. I couldn't quickly find the PDF, but I think this is a poster of the list:

enter image description here

score 15 · Answer 2 · edited Oct 02 '10 at 13:36

15

If the problem is Internet-related, it's probably the DNS.
If the problem is hard to diagnose, it's probably the RAM.
If the problem is with a Windows workstation, it's probably quickest to reimage it.
If the problem is on a Friday, it's probably something serious.

edited Oct 02 '10 at 13:36

Peter Mortensen

2,319
5
23
24

answered May 16 '09 at 04:28

Adam

2,790
21
18

I wanted to downvote a joke post, but it's surprisingly accurate! – TessellatingHeckler Oct 22 '10 at 05:21
I liked #3; couldn't be more true. – Federer Jan 13 '15 at 21:03

Zoredache · Answer 3 · 2009-05-15T23:35:01.933

I like to fall back to the scientific method.

From (http://en.wikipedia.org/wiki/Scientific_method)

Define the question

Gather information and resources (observe)

Form hypothesis

Perform experiment and collect data

Analyze data

Interpret data and draw conclusions that serve as a starting point for new hypothesis

Document Results

As a general rule I always like to try and double check my basic assumptions. Does it have power, is it plugged in, is the wiring good. It is very annoying to spend hours on trying to look at a software issue when you have a loose cable.

I find it very important during the hypothesis creation phase to actually come up with as many possible causes of the problem as I can. Then I try and choose ideas to test first based on how easy it is to test, and how probable the idea is.

It is also important to get help. Consult your coworkers, vendor, or whoever is the most knowledgeable about systems in question if you can. Don't spend lots of time spinning your wheels on a problem if there is someone available that can help you solve the issue.

O'Reilly has a good book Network Troubleshooting Tools that has a good set of steps to follow that is very similar to scientific method. I found the book very useful and strongly recommend it. The book goes into a lot more detail and suggests many useful tools.

From Network Troubleshooting Tools

State your goal

Define the system

Identify possible outcomes

Identify and select what you will measure

If appropriate identify test paramaters and factors

Select tools

Establish measurement constraints

Review experimental design

Collect data

Analyze data

See Also:

3COM has a troubleshooting guide
Murphy's law - Anything that can possibly go wrong, does.
Occam's_razor - All other things being equal, the simplest solution is the best.

Definitely. Albeit, step 7 is somewhat humerous. My doc usually ends up like "Yep, it's fixed. Now it works." — squillman, May 15 '09 at 23:31
I respect the scientific method, thought I believe before it comes into place there should be a human factor that needs to be ran through. For example, I have to consider the source of the reporting (the person reporting the issue)... and be careful not to assume he/she is a 'trustful' source (by trustful, I mean he/she will be a good resource for assisting me in defining the question, gather information, and form my first hypothesis). — l0c0b0x, May 16 '09 at 17:07

score 10 · Answer 4 · answered May 17 '09 at 16:25

(These highlights are paraphrased from the "Debugging" chapter of "The Practice of System and Network Administration")

Two things to know:

Know what the "fixed" version looks like. Preferably a command you can run that gives a certain output when things work. For example: I'm trying to figure out why SSH asks for a password when I've set up the keys properly (or so I thought). So my test is: "ssh servername uptime" and it should work without asking a password.
Describe the problem at the right level. A user complaining that they can't ping a server should not send you off to run and fix the server. The person's job isn't to sit around and ping a machine all day. They want to get some kind of task done like use the machine as their DNS server. Example: Once a user complained that they couldn't ping a machine half way around the world. I spend the day tracking down sysadmins in that part of the company to find out what was wrong with that machine. It was decommissioned and they were in a panic because they thought maybe they had powered off the wrong machine. I contacted the user and said "besides needing to ping this machine, what would you like to be doing with it?". It turned out that he wanted to run a certain job on it and if he had been following the proper procedure his tasks would have been automatically redirected to the replacement machine. I had wasted my entire day and the time of the local sysadmins. Another reason "I can't ping" isn't the right thing to be testing: Often firewalls are configured to drop ping packets but permit other packets through. Test what you want to go through.

Two strategies:

Additive: Keep adding components until the problem starts. The last thing you added is the problem. Example: Web browsers can't talk to a server. Between the server and the user is a load balancer, a firewall, a cache, and the user's local web proxy. First try sending queries directly to the server, then through the LB to the server, then through the firewall to the LB to the server, etc. etc. each time adding one component.
Subtractive: Keep removing components until the problem goes away. The last thing you removed was the problem: Example: A machine with dozens of cards won't boot. Keep removing cards until the machine boots.

Two bits of dumb luck:

Forget everything I said. The problem is being caused by the last change made to the system. (this works 99% of the time... the problem is that 99% of the time you don't know what the last change actually was)
When all else fails, check for stupid things. http://whatexit.org/tal/mywritings/dumb-things-to-check.html Example: A crazy problem just couldn't be explained. Then we checked the configuration file: a user had edited it by copying it to a Windows box, editing it, then copying it back. It now had a ^M at the end of every line. We never noticed because our text editor silently hid this fact. Sadly, the software that read the configuration file turned those ^Ms into a non-break space which screwed up tons of other procedures.

score 6 · Answer 5 · answered Oct 22 '10 at 06:04

Attitudes I try and hold:

Absolute confidence that cause and effect works and nothing is magic. Nothing is happening that is actually weird, only things which I don't understand.
Absolute confidence that if I keep pushing it, I will get it resolved (this may involve taking it to someone more knowledgable, learning, asking for help, hard work, etc).
Grumbling about how a setup, program or scenario is badly designed or really stupid just does not help, so don't do it. (I find this hard, grumbling is fun).

These are attitudes that are helpful for me to hold - they stop me throwing my arms up in the air, declaring something "bizarre" and then giving up, or getting unhappy because it feels "unsolvable".

Ways I think about troubleshooting:

Systems have lots of parts, if they are connected together or configured randomly then they wont work as desired. There are one or two very specific configurations which will work - of all the millions of ways to pile bricks and metal, only a few are bridges and only one or two are good enough bridges. The cause could be a character in a text file or a failed server, but every part has to be right for the whole thing to be right. I need to be willing to be thorough and meticulous if needed. Systems cannot do "the show must go on".
You start out with an entire system like a map, you imagine a cloud of probability floating over the map representing "where the problem is" and your job is to use experience and find tests to push the probability away from some areas and towards others and to condense it down to points which are high probability problem locations, then attack those. This comes back to the cause and effect point - the problem is in the system, it is not magic. It is a problem which exists so it must exist somewhere.
Anything can be setup any way anyone wants. The only way we can define one behaviour as "OK" and another as "a problem" is because what someone is getting is not what they want. You must understand what they want, what they are getting clearly and specifically.

The process of troubleshooting:

What is the problem. Make sure you see it happening and can reproduce it yourself so there's no miscommunication. So often problems have been through several people in our helpdesk by the time they get to me still nobody can explain to me what the problem really is.
It's recursive bisection all over again - divide and conquer, binary search - you come up with a test that will prove if the problem is this side of the test, or that side, and make the test so it eliminates as much as possible. Repeat until solved.
Don't learn if you can avoid it - better to lock the database account and prove that the problem still happens when the database is not involved than to spend hours learning how the database is used.
It's way too easy to find myself thinking "I don't know what to do next". Notice when that happens and go back to coming up with tests which locate the problem.

The Internet isn't working? Check the problem, find it's a website they can't get to. Quick tests involve their internet connection (working), does it load for me (no). Quick tests point to it being the site. By seeing the problem happens for me, I've pushed the probability quickly away from their PC, browser, DNS, user account office firewall, etc.

So the site doesn't load, now what? That's not fixable yet, so look for places to carve the problem into a smaller one. Is the server on? Does it ping? does DNS work? Yes. Does the service answer on port 80? No. Is the service running? No. Does it start? No. Does it give errors in the event log / logfiles? Yes! What do they say?

This is efficient and fast troubleshooting because it's relentlessly focused on narrowing down the scope of the problem. If I accepted their report that the internet isn't working, I would be misguided into thinking it a connection failure. If I accepted my first sighting that it doesn't load for them, I would waste time on their computer thinking it is at fault.

Carve out chunks of "things it cannot be" as big as possible.

Understand the system. The more general knowledge I have about a system, the easier it gets. Where I have weak understanding, problems are more intimidating, more difficult, slower going, and more likely to end up with a workaround than a fix, or with a big dumb slow fix (reinstall) than a small, precise surgical fix.

score 6 · Answer 6 · answered May 16 '09 at 12:43

General practices I remember during the whole process:

Write everything I do down.
Make only one change at a time.
If possible, reverse the change before trying another unless definite progress is being made.

During the troubleshooting here defines my basic methodology:

When the system is up and running well, before there's a problem, I try to learn to see what it's doing. Joe Richards explains why a lot better than I could in this short space.
I start with the simplest solution. For instance, no network connectivity? Check the physical layer. I can't tell you how many times intermittent connection problems weren't a server issue but a network cable that was half-in or one that had gone bad.
I try to capture all of the symptoms I can see from all the likely sources before I start making changes.
I run preliminary diagnostic tests. For instance, when I get told a server is down, the first thing I do is use ping and nbtstat (Windows) to verify that. The problem could be at the distant end (to borrow an old Air Force tech control saying).
I am not afraid to do the research. Google, support.microsoft.com, eventid.net and sites like that are your friend.
I am not afraid to ask for help from the community. Not just sites like serverfault.com, but I have a good assortment of folks I trust and respect on Twitter I keep in contact with.
I evaluate the answers I'm finding with what I'm seeing. I don't assume that any one solution is the right one until I can do enough considerations of the evidence I'm seeing with what is reported in the solution.

score 4 · Answer 7 · answered May 16 '09 at 00:03

4

Generally I ask "What has changed that might have caused this problem"? Most issues are caused by changes to known good configurations. If you can isolate who did the change then you usually get your answer.

answered May 16 '09 at 00:03

PowerApp101

2,604
1
20
28

score 2 · Answer 8 · edited Oct 02 '10 at 13:38

I think it's a skill, not a science. There are times when you go down the wrong path but for the most part:

Have a good basic understanding of all the associated technologies - Network, hardware, OS, software, development, etc. - will help you eliminate some of those "wrong paths"
think basic - don't jump to the most complicated scenraio because it's in your head, perform your basic troubleshooting and let it lead you.

I once had my boss call me with a "senior" engineer on the phone - he was telling me that he had one server that could not connect and he had tried switching the cable but still no joy. I could hear beeping in the background like a UPS on batteries. I asked him if he could see activity on the switch, he said no. I asked him if the beeping was coming from the UPS, he said yes, I asked him if he could see any lights on at all in the rack he said no... Look beyond your nose - it helps!

score 1 · Answer 9 · answered May 15 '09 at 23:22

1

I start by checking the obvious. Is there an error message explaining what the problem is? Is everything connected properly? I don't like wasting several hours troubleshooting something that could have been solved in a few minutes. I think it's possible to be too methodical. I've seen people waste an entire day reproducing a problem despite the fact I told them precisely what the problem was. That's not what I pay them for.

If the answer isn't obvious, line up some suspects and test those first. Only after you test the likely suspects should you test the unlikely suspects. Then you can be as scientific as you want.

answered May 15 '09 at 23:22

Scott

1,173
3
13
25

hmm. i partly agree - or at least i think it's easy to follow someone else's rules without truly understanding how/when they're appropriate. Like high-school students who are forced to study math, but who wouldn't recognize a situation where they could use what they have learned in real life. But understanding the right time to apply the right rule can really be a boon. Eg: Google "HalfSplit method" for an example of a demonstrably efficient troubleshooting rule – username May 16 '09 at 00:41
Your method of ruling out the obvious isn't unscientific. You are just running through several iterations of the hypothesize and testing steps quickly. I strongly agree you should give priority to ideas you can test for quickly. – Zoredache May 16 '09 at 02:54

Your troubleshooting rules, approach to troubleshooting?

9 Answers9

Linked