Questions tagged [disaster-recovery]

Disaster recovery and preparedness is an unfortunate aspect of systems administration. This tag should be used for help with planning, implementation and best-practices related to recovering from a catastrophic event on a server or in a datacenter environment.

Recovering from an unplanned, catastrophic outage is a painful process whether you are managing a single server or an entire datacenter. Roof leaks, broken water lines, power outages and any number of other events can take what was a great day and turn it into a living nightmare when you are responsible for keeping systems others rely on available.

The key to recovering from any disaster is preparedness. Knowing the steps required to bring the network and systems back online is critical. Before one can properly prepare for a disaster it is necessary to understand the risks, bottlenecks and other critical components of the overall system, e.g. who controls the power, internet, etc at your site. Understanding the aspects of disaster recovery that are within ones control is a very important aspect when planning; if there is not someone on staff who can fix the power, HVAC, etc make sure that the contact info for someone who can is written down somewhere. Having a large amount of information available before a disaster occurs will help to keep everyone calm, cool and on-task when something actually does happen.

Once a risks are assessed and a plan is created, print out physical copies, email it, and make sure everyone with admin level access to the systems/datacenter has read and is familiar with them. The best plan in the world is worthless if it is on a system that is down and cannot be easily restored without following the plan. After everyone is familiar with the plan, practice when possible; in many situations it may not be realistic, but if possible take advantage of planned downtimes or natural outages to go through the recovery plan and refine it.

In summary, when a disaster happens:

  1. Don't Panic! Panic turns a debacle into a catastrophe every time.
  2. Plan ahead, understand the risks, and know what is within your control
  3. Follow the plan but be flexible, a recovery plan is more of a jazz tune than a military march
  4. Stay calm and organized, use check lists, keep notes
  5. If you are working in a team or group communicate and collaborate
  6. Be vigilant, update your plan as the environment changes
  7. Check your backups, make sure they happen at regular intervals and that the data contained therein is still good.
358 questions
1 answer

desperate attempt to head off RAID 0 failure, can i copy a disk with dd?

A colleague has been using a 10TB RAID 0 array to edit a film. He suddenly realized it's 98% full. The array has started seriously acting up, but cooperates grudgingly if you shut down all processes that might index this array and interact only with…
S. Imp
  • 506
  • 1
  • 3
  • 17
2 answers

Cloud definition met?

note: question has been reviewed after suggestions it could be opinion based We recently suffered an outage due to a popular big provider's datacenter fire. Our public cloud instances hosted in that datacenter disappeared from the hosting panel. The…
  • 1,679
  • 3
  • 17
  • 31
1 answer

Disaster recovery process document for Google cloud Virtual Machine

Hello we have been directed by auditors to provide disaster recovery document for Google Cloud Virtual Machine. Please Is this provided by Google? If so how can I get it. I would appreciate resources I can use to create one for my virtual machine on…
0 answers

How to make backup for the Local Server

I have critical server that running the local company system, that have system backup and company files and business documents with critical and important severity. I'm planning to create a Network attached server, however this server weather its…
1 answer

promox is blocked from fully booting by ifupdown2 start job

i have a promox server with ifupdown2 installed. recently when trying to reboot the entire server promox failed to get to console, it gets stuck on replication runner failing and the ifupdown2 service not starting up. i have gained access the server…
  • 111
  • 6
0 answers

Can't reconnect VPS Essential OVH after removing set_hostname from cloud-init

I have a VPS Essential in OVH with Ubuntu 18.04 LTS installed. I just removed the set_hostname module of the cloud_init_modules from /etc/cloud/cloud.cfg and reboot the system and now I cannot connect to it, this is the line I removed but from other…
  • 113
  • 5
4 answers

Exchange 2007 - Mailbox Database Recovery

Exchange 2007 edb Can we restore Exchange edb (First storage group\mailbox database.edb) to another exchange server ? Do I just copy the edb to the new exchange server and delete the first storage group\mailbox database.edb and replace it with this…
1 answer

Azure VM off-site backups via CLI

I'm currently backing up Azure VMs via Azure CLI: create resource group: az group create -n backup-resource-group -l uksouth create recovery services vault: az backup vault create --resource-group backup-resource-group --name backup --location…
5 answers

Disaster recovery options for my lone server running W2K3 std

What is the best disaster recovery option for my machine running W2K3 std edition? I have already imaged my machine using Clonezilla and I have also taken a backup using Windows Automated System Recovery tool. However, I am worried neither of these…
1 answer

Does Barman need ssh connection setup when installed on the same host as the PostgreSQL host?

I'm trying to configure Barman (pgbarman) to work alongside the PostgreSQL database on the same host, i don't want to install barman on a separated host, but I can't find any documentation about such approach. All of the resources on the internet…
2 answers

Disaster Recovery for VMWare ESX

We need to formulate a plan for disaster recovery a VMWare installation. The components are two ESX hosts and a NAS unit. We're wondering what everyone here uses for their farm? We have a few ideas but wanted to compare with other system admins. The…
1 answer

Disaster recovery. MDADM/LVM2 Some advance but stuck on final mount

We made a stupid upgrade on a running server using wrong repositories and the system became totally unbootable. The system, a SLES 11 we used a openSuse repository to upgrade, and everything went horribly wrong. It boots now only in (repair…
1 answer

MySQL recovery procedure

After the server die, I have "old" HD (it is external now) with sock file and data files in mysql directory. Where I should find information how to restore the data in fresh installed MySQL? What will happens if I replace files and folders brutal…
1 2 3