Attitudes I try and hold:
- Absolute confidence that cause and effect works and nothing is magic. Nothing is happening that is actually weird, only things which I don't understand.
- Absolute confidence that if I keep pushing it, I will get it resolved (this may involve taking it to someone more knowledgable, learning, asking for help, hard work, etc).
- Grumbling about how a setup, program or scenario is badly designed or really stupid just does not help, so don't do it. (I find this hard, grumbling is fun).
These are attitudes that are helpful for me to hold - they stop me throwing my arms up in the air, declaring something "bizarre" and then giving up, or getting unhappy because it feels "unsolvable".
Ways I think about troubleshooting:
- Systems have lots of parts, if they are connected together or configured randomly then they wont work as desired. There are one or two very specific configurations which will work - of all the millions of ways to pile bricks and metal, only a few are bridges and only one or two are good enough bridges. The cause could be a character in a text file or a failed server, but every part has to be right for the whole thing to be right. I need to be willing to be thorough and meticulous if needed. Systems cannot do "the show must go on".
- You start out with an entire system like a map, you imagine a cloud of probability floating over the map representing "where the problem is" and your job is to use experience and find tests to push the probability away from some areas and towards others and to condense it down to points which are high probability problem locations, then attack those. This comes back to the cause and effect point - the problem is in the system, it is not magic. It is a problem which exists so it must exist somewhere.
- Anything can be setup any way anyone wants. The only way we can define one behaviour as "OK" and another as "a problem" is because what someone is getting is not what they want. You must understand what they want, what they are getting clearly and specifically.
The process of troubleshooting:
- What is the problem. Make sure you see it happening and can reproduce it yourself so there's no miscommunication. So often problems have been through several people in our helpdesk by the time they get to me still nobody can explain to me what the problem really is.
- It's recursive bisection all over again - divide and conquer, binary search - you come up with a test that will prove if the problem is this side of the test, or that side, and make the test so it eliminates as much as possible. Repeat until solved.
- Don't learn if you can avoid it - better to lock the database account and prove that the problem still happens when the database is not involved than to spend hours learning how the database is used.
- It's way too easy to find myself thinking "I don't know what to do next". Notice when that happens and go back to coming up with tests which locate the problem.
The Internet isn't working? Check the problem, find it's a website they can't get to. Quick tests involve their internet connection (working), does it load for me (no). Quick tests point to it being the site. By seeing the problem happens for me, I've pushed the probability quickly away from their PC, browser, DNS, user account office firewall, etc.
So the site doesn't load, now what? That's not fixable yet, so look for places to carve the problem into a smaller one. Is the server on? Does it ping? does DNS work? Yes. Does the service answer on port 80? No. Is the service running? No. Does it start? No. Does it give errors in the event log / logfiles? Yes! What do they say?
This is efficient and fast troubleshooting because it's relentlessly focused on narrowing down the scope of the problem. If I accepted their report that the internet isn't working, I would be misguided into thinking it a connection failure. If I accepted my first sighting that it doesn't load for them, I would waste time on their computer thinking it is at fault.
Carve out chunks of "things it cannot be" as big as possible.
Understand the system. The more general knowledge I have about a system, the easier it gets. Where I have weak understanding, problems are more intimidating, more difficult, slower going, and more likely to end up with a workaround than a fix, or with a big dumb slow fix (reinstall) than a small, precise surgical fix.