Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
12
votes
3 answers

Battery Backed Write Cache

I recently got some U server price quotes and some of them include BBWC: What exactly does it do? Is it just for RAID configurations? If there is a power malfunction, isn't the data loss inevitable? Are there any performance improvements from it…
Dani
  • 1,216
  • 1
  • 13
  • 20
12
votes
4 answers

How do I backup my TRAC installations?

We use separate TRAC instances as our ticket system for many projects and need to have them moved off site several times a day for disaster recovery. What is the best way to make this happen? Is there something similar to svnsync for subversion?
Mike Schall
  • 241
  • 2
  • 5
12
votes
3 answers

How to actually use mysql slave as soon the master is failover or got burnt

I have MySQL master-slave replication that works fine; I googled the whole net and MySQL site to find the standard procedure to make use of the replication but found nothing. It is as if admins are happy to have replication on, but when the time…
Jawad Al Shaikh
  • 254
  • 1
  • 3
  • 15
11
votes
12 answers

What's the first thing you check when an untouched unix server starts going berserk?

So you have this neatly setup unix server and it's super fast and works swell and everything is great for months, and suddenly all kinds of weird errors start showing up for a variety of different services and none of them make a lot of sense on…
kch
  • 4,472
  • 3
  • 19
  • 17
11
votes
5 answers

High server availabilty for a small business

After having a bit of scare with a server that wouldn't come up one morning, the higher ups have decided that the business needs a high availability / fail over setup. We have 5 main servers (4x Linux, 1x OpenBSD) all of which need to be running for…
9
votes
3 answers

Database accidentally deleted with a bash script

Edit: a follow-up question: Restore mongoDB by --repair and WiredTiger. My developer committed a huge mistake and we cannot find our Mongo database anywhere in the server. He logged into the server, and saved the following shell under…
8
votes
3 answers

Backing up VirtualBox VMs

Does anyone have a good complete strategy for backing up a bunch of virtual machines running under VirtualBox? I intend to run a handful of virtual machines on a single hardware platform and back them up nightly to external disks, which will be…
8
votes
3 answers

Active Directory disaster recovery with DPM

I have a sort of catch-22 question here. Suppose I'm using Microsoft System Center Data Protection Manager (2010 or 2012, it works the same way) to backup, amongst various other things, my Active Directory environment (as in "the System State of my…
Massimo
  • 68,714
  • 56
  • 196
  • 319
8
votes
1 answer

Recover data from SCSI hard disk

We've got an old server with SCSI hard disk. The server crashed last week and it isn't exactly known what hardware component is damaged. Since the server is due to be retired anyway we don't want to repair it but just restore the data from the SCSI…
Tom
  • 101
  • 1
  • 6
8
votes
1 answer

Recovery strategy for Master-Master replication

I have implemented a HA solution for mysql based on master-master replication. There is a mechanism on the front end part which guarantees that only one db will be read/written to at a given time (i.e. we only use replication for HA). I have…
8
votes
1 answer

Does one failed drive + one single bad sector destroy an entire RAID 5?

During planning my RAID setup on a Synology Disk Station I've done a lot of reading about various RAID types, being this a great reading: RAID levels and the importance of URE (Unrecoverable Read Error). However, one thing remains unclear to…
adamsfamily
  • 245
  • 2
  • 9
7
votes
1 answer

How do I configure a stretch cluster without shared storage between two sites?

I am trying to redesign our IT infrastructure and seeking help in implementing DR solution for our company. I see that as 2 data centers in active-passive mode with the data replication. Currently we have two Windows Servers 2016 at the primary…
katyn12
  • 155
  • 1
  • 7
7
votes
2 answers

How to recover data from an Exchange 2013 database after a complete Active Directory loss?

Scenario: a single Exchange 2013 server in a Windows Server 2003 AD domain; one DC malfunctioned months ago and was dismissed (without proper demotion, no less); the other DC died yesterday and there are no available backups. Simply put, that AD is…
Massimo
  • 68,714
  • 56
  • 196
  • 319
7
votes
2 answers

Hadoop HDFS Backup & DR Strategy

We are preparing to implement our first Hadoop cluster. As such we are starting out small with a four node setup. (1 master node, and 3 worker nodes) Each node will have 6TB of storage. (6 x 1TB disks) We went with a SuperMicro 4-node chassis so…
Matt Keller
  • 221
  • 4
  • 7
7
votes
3 answers

If DNS Failover is not recommended, what is?

As a followup question to his very popular question: Why is DNS failover not recommended?, I think it was agreed that DNS failover is not 100% reliable due to caching. However the highest voted answer did not really discuss what is the better…
1
2
3
23 24