Users can't get to their e-mail, the CEO can't get to the company's home page, and your pager just went off with a "911" code. What do you do when everything blows up?
20 Answers
Stay Calm
Don't freak out. Breathe! (From the diaphragm, it helps.) If you've studied meditation, that can help too.
When faced with extreme stress your body will go into a flight-or-fight mode, because your body thinks it's in a life-or-death situation. At this time your body will actually pump less blood to some parts of your brain, lessening functions like reasoning. This effectively lowers your IQ as instinct, instead of rationality, begins to dominate your brain functions. If you've ever been in or witnessed a heated argument you may recognize these symptoms as peoples' emotions flare and rationality takes a holiday. Later, when people have a chance to cool down they will be more likely to accept having made a mistake or having been wrong, and are more capable of seeing the other side, but in the heat of the moment, less so.
Maintaining your composure and keeping your wits about you will keep your brain functioning at full capacity and ensure you make rational decisions based on evidence and reason rather than emotion and fear.
Triage
Efficient application of limited resources to achieve the greatest benefit at the lowest cost is supremely important here. Make the decisions as early as possible which things have to be fixed RIGHT NOW, which can wait a little while (hours, days), and which can wait indefinitely. Also learn to realize when something is unsalvageable and not worth saving (e.g. half the router melted, even if it's your only one, you can't save it, buy a new one and get it on site post-haste or find something that can fill the gap temporarily).
Retain Situational Awareness
Don't allow your attention to be trapped by some interesting problem or by something you don't quite understand yet. Keep focused on the big picture and on getting the most important things working.
Use the Scientific Method
Form a hypothesis. Determine how you would test this hypothesis. Gather data to test the hypothesis. Look for dis-confirming data as well. Refine your hypothesis and repeat the cycle as many times as necessary until you have enough confidence in your hypothesis to take action.
Be Pragmatic
Now is not the time for dogma. It's ok to take a few shortcuts here and there when recovering from disaster. This is essentially accruing technical debt. At many companies, catastrophic failure means catastrophic loss of revenue. It's better to get things running, even if on a shaky footing, than to dilly-dally and risk the livelihood of your company. As always, judgement is supremely important here. Sometimes it makes sense to prop up a box fan pointed at a server rack, sometimes it doesn't.
Look After Yourself
How long have you been working on this emergency? When was the last time you had a drink of water? When was the last time you've eaten? How long have you been awake? Don't burn yourself out just because there's an emergency, take the time to keep hydrated, fed, and rested (in case it's a long, multi-day slog).
Recruit Help
There are almost certainly many talented folks in your company who are both motivated and capable of lending help. Be wary of getting too many people running around and causing trouble for each other though. Also be wary of annoying people by putting them through a "firedrill". Find people who want to help already, get them working on targeted tasks, and make sure people are communicating with each other.
Communicate
Communication is critical. Nothing is as scary as the unknown. When people know nothing other than that something is broken, an empty statement that it'll be back up in X hours is only mildly reassuring (even less reassuring after X hours have passed and things are still broken). The pressures at play can steer you toward giving overly optimistic WAG time-estimates, but this is the wrong course. Don't just say you're working on it, don't just say things will be fixed by X time. Be open, show your process, detail your progress and your setbacks. Provide insight into the problem, your process in tracking it down, and your plan for fixing things (though don't drown people in minutiae). Show that the problem is not intractable, show that things will be made right eventually, show that there are competent people on the problem, these things are more reassuring than baseless time-schedule promises.
- 1,597
- 11
- 16
The first answer is stay calm! I learned that the hard way that panicking often just makes things worse. Once thats achieved the next thing is to actually ascertain what the problem is. Complaints from users and managers will be coming at you from all angles, telling you what THEY cannot do, but not what the problem is.
Once you know the problem you can start the plan to fix it and start giving your angry users a timescale!
- 38,158
- 6
- 77
- 113
-
3This is a reactive plan. A true disaster recovery plan is already written and tested for every critical business process. – spoulson Apr 30 '09 at 13:43
-
3spaulson sure: but the first thing to do is figure out if you need to active the plan or if flipping the circuit breaker will fix it all. – pjz Apr 30 '09 at 16:22
-
1This is actually the best thing to do, PERFECT POST! After you have to be able to hold all the pressure on your back because as said on a comments above, everyone will rush at your office to tell you that they can go where they want. Actually users are most of the time really selfish in these moment and they don't want to understand at all, they just want that THEY things work and they don't care about the rest... So I' totally agree with your post! – Marc-Andre R. May 07 '09 at 18:52
-
+1 for distinguishing "the problem" from the symptoms. – bmb May 08 '09 at 23:19
Don't Panic.
- 5,337
- 2
- 18
- 17
-
4
-
1
-
11Grab a towel and leave a message say "So long, and thanks for all the fish". – Jauder Ho May 01 '09 at 01:26
-
1
-
Check the basics first, it seems silly, but things like
- Is the power on at the server facility? (if you host off-site)
- Is your hosting provider down?
I know that a lot of time can be wasted looking for a solution when the problem is upstream
- 2,330
- 2
- 29
- 41
-
2yep - if it's all going down - check the datacentre - and their support forums. If there's 30 people online, when there is usually 3 - it's hitting the fan. – Alister Bulman Apr 30 '09 at 08:59
Sorry, but this question is already perfectly answered in Favorite sysadmin cartoon:
- 283
- 3
- 6
I ping stuff. What happens after that varies greatly depending on the results of the ping.
- 576
- 2
- 9
- 23
-
Used this method today. Lots of PC's couldn't print. Tried to ping database server, OK. Tried to ping printer licence server, no response. Result = Server fault! – Swinders Apr 30 '09 at 08:58
-
Nice point ;) I do that many time per day before doing anything else. This is actually so much time saver :P – Marc-Andre R. May 07 '09 at 18:56
RTFLF - Read the Frakkin' Log File
(I can't take credit for this, it all goes to Scott Hanselman)
- 521
- 8
- 15
Don't try to fix anything yet.
Make sure you know exactly what the real, underlying, problem is. Now starting fixing things. If there are multiple things to fix, carefully consider which things can be delayed (hopefully until the next working day, at least!) and which absolutely must be fixed now.
But most importantly: Once everything is working, ask why did "everything blow up"? What are you going to do to prevent this happening again? Are there any steps that would make the solution easier if it does happen again?
- 925
- 1
- 8
- 10
Check the cabling! I've lost hours checking other stuff when a simple Eth0 cable swap would have solved the problem...
- 201
- 4
- 9
-
Actually a cable doesn't die for no reason. If it's not well stack, wrap or any other protection method and that everyone could play with it, actually yes, a cable is likely to break. Otherwise, there's no reason. – Marc-Andre R. May 07 '09 at 19:01
Let people know, that you're on it and if possible give them an estimate on when things will be back to normal.
As for actual troubleshooting that obviously depends on what is wrong. I usually keep a collection of "check status" scripts for various services.
- 229
- 2
- 5
-
-
This is an excellent point. Prevention is the key to avoid big disaster ;) – Marc-Andre R. May 07 '09 at 18:58
Make sure the backup of your resume is safe :) Then,
Find the commonalities. What's common to all the systems that are affected.
Find what's changed. You should have some formal change management going on in your organisation.
Where's the new guy... where's the boss...? Did one of them take a shortcut? (it's just a quick server reboot, what could it possibly hurt)
- 1,826
- 10
- 29
- 44
I like this troubleshoot list Simple Trouble Shooting Application Now Fixes Everything =)
- 356
- 4
- 13
You should have contingency plans.
Essential systems should be designed with either automatic failover or a documented and tested recovery plan.
The more important the system, the more resilience you need to build in and the more automatic it should be.
If you don't have one, then it's wasn't important, was it!
- 2,658
- 2
- 20
- 24
It's difficult from the statement to provide a specific set of actions. Your first move will be based on:
- Where you are
- How much information you are able to squeeze out of the person that contacted you
- What immediate tools do you have at hand for troubleshooting (or information seeking)
- Your knowledge about the physical and logical paths for your network
- How much help you have (part of a team? or lonesome ninja?)
Obviously, you need to keep calm and alert about the issue at hand. Your experience with network troubleshooting will have taught you that this could very well be something trivial, like:
- A disconnected cable
- An unannounced maintenance (another tech 'fixing' things)
- Your CEO over-reacting about the company being completely doomed after his laptop wireless connectivity is lost due to him/her microwaving a cheese pizza.
Having said that, it could also be something serious in the categories of:
- Physical Transport (connectivity)
- Hardware (router\switch\server)
- Storage (unaccessible\compromised\deleted)
- Software (Service> Misconfigured\Attacked\offline)
The key component is how much YOU KNOW about the issue. What's your reference point? (from what perspective is 'the system down'?).
- 11,697
- 6
- 46
- 76
Check DNS.
Start simple and work towards the absurd.
Power?
Ethernet?
Program running?
...
Aliens?
- 297
- 1
- 3
- 9