Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
151
votes
9 answers

Monday morning mistake: sudo rm -rf --no-preserve-root /

Please note: The answers and comments to this question contains content from another, similar question that has received a lot of attention from outside media but turned out to be hoax question in some kind of viral marketing scheme. As we don't…
Jonas Bylov
  • 1,623
  • 3
  • 11
  • 5
122
votes
13 answers

Engineers are using explosives to remove hard rock outside our office building. What countermeasures should we take?

Our building is located approx. 100 meters from the explosive charges. They happen several times per day, and really shake the entire building a lot. This is going to go on for many days and the blasts are supposed to get stronger. Our server rooms…
Chris Dale
  • 1,553
  • 2
  • 12
  • 22
46
votes
6 answers

How to backup GPG?

What are the critical files I need to backup from GPG? I guess my private key would qualify of course, but what else?
jldupont
  • 1,779
  • 4
  • 23
  • 27
40
votes
20 answers

What's your checklist for when everything blows up?

Users can't get to their e-mail, the CEO can't get to the company's home page, and your pager just went off with a "911" code. What do you do when everything blows up?
Jon Galloway
  • 1,506
  • 1
  • 17
  • 20
37
votes
10 answers

Unmount a nfs mount where the nfs server has disappeared

Server A used to be a NFS server. Server B was mounting an export of that. Everything was fine. Then A died. Just switched off. Gone. Vanished. However that folder is still mounted on B. I obviously can't cd into it or anything. However umount…
Amandasaurus
  • 30,211
  • 62
  • 184
  • 246
35
votes
7 answers

What things do you look for when picking a server hosting company?

We are going through an RFP process of changing hosting companies for most of our servers (~10 fairly powerful workhorses and database servers). When the existing company was picked I wasn't at the company, nor have I worked with hosting companies…
ProfessionalAmateur
  • 917
  • 5
  • 17
  • 26
32
votes
4 answers

My server room has flooded

We recently went through a hurricane and our server room became flooded. Hooray for insurance. Anyway, I need to save as much data off one of the hard drives as possible. Yes, it was submerged for the better part of two days. Do I need to open…
29
votes
11 answers

Disaster recovery plan development best practicies or resources?

I have been tasked with leading a project regarding updating a old and somewhat onesided disaster recovery plan. For now we're just looking at getting the IT side of DR sorted out. The last time they did this they set their scope by making up a…
Laura Thomas
  • 2,825
  • 1
  • 26
  • 24
26
votes
5 answers

BBWC: in theory a good idea but has one ever saved your data?

I'm familiar with what a BBWC (Battery-backed write cache) is intended to do - and previously used them in my servers even with good UPS. There are obvously failures it does not provide protection for. I'm curious to understand whether it actually…
symcbean
  • 19,931
  • 1
  • 29
  • 49
26
votes
2 answers

Retrieving an RSA key from a running instance of Apache?

I created an RSA keypair for an SSL certificate and stored the private key in /etc/ssl/private/server.key. Unfortunately this was the only copy of the private key that I had. Then I accidentally overwrote the file on disk (yes, I know). Apache is…
Nathan Osman
  • 2,705
  • 7
  • 31
  • 46
19
votes
9 answers

Architecture for highly available MySQL with automatic failover in physically diverse locations

I have been researching high availability (HA) solutions for MySQL between data centers. For servers located in the same physical environment, I have preferred dual master with heartbeat (floating VIP) using an active passive approach. The…
Warner
  • 23,440
  • 2
  • 57
  • 69
17
votes
9 answers

Documentation As-A-Manual vs. Documentation As-A-Checklist

I've had discussions in the past with other people in my department about documentation, specifically, level-of-detail and requirements. In their view, documentation is a simple checklist of Y things to do when X things go wrong. I disagree. I…
Avery Payne
  • 14,326
  • 1
  • 48
  • 87
15
votes
7 answers

Setting up a new backup scheme

I'm in the process of designing my first ever backup scheme. I'm completely new to managing data backup, and there are some concepts that I don't totally understand. Here's what I've got so far, and what equipment I'll be using. There are only three…
Citizen Chin
  • 221
  • 2
  • 5
15
votes
6 answers

How to recover from a drive failure in a RAID 5 configuration?

This morning a drive failed on our database server. The drive array (3 disks) is setup in a RAID 5 configuration. While we wait for a drive replacement we are preparing for a recovery strategy. Users are continuing to work on the system, albeit very…
Philip Fourie
  • 537
  • 2
  • 6
  • 13
14
votes
4 answers

IT lead does not have a backup, DR plan in writing

This is a general management question to IT managers out there. We are a small firm with about 4 servers in our colo cabinent. No full time IT manager. But we do have one person on monthly contract and I am having a terrible time getting him to…
Alex
  • 259
  • 1
  • 9
1
2 3
23 24