29

I have been tasked with leading a project regarding updating a old and somewhat onesided disaster recovery plan. For now we're just looking at getting the IT side of DR sorted out. The last time they did this they set their scope by making up a single disaster (the data center flooded) and planning for it to the exclusion of all other disaster types. I would like to take a more well rounded approach. I know this is a solved problem, other organizations have written DR plans.

Our plan is to take our IT DR plan and go forward with it and say "Hey, this is what we want in a DR plan for IT, does it mesh with what the rest of the University is doing? Are there restored service priorties you'd like changed?" We have a pretty good idea what the rest of the plan is and we're expecting this to go over well.

What I am looking for is guidance on how to scope a DR plan and what questions I should be thinking about. Do you have favorite resources, books, training that are related to DR plan development?

Laura Thomas
  • 2,825
  • 1
  • 26
  • 24

11 Answers11

12

Make sure you have a emergency contact roster. aka a Recall Roster

It should look like a tree, and show who contacts who. At the end of a branch, the last person should call the first and report anyone who could not be contacted.

(This can be co-ordinated through HR, and used for any type of disaster)

Joseph Kern
  • 9,809
  • 3
  • 31
  • 55
  • 1
    We had been thinking of at the very least a list of all faculty, staff and students placed offsite daily. Having a tree structure for faculty and staff is a great idea. – Laura Thomas Jun 18 '09 at 18:14
12

An excellent source of information is Disaster Recovery Journal (about).

Community resources available include the current draft of their Generally Accepted Practices (GAP) document, which provides an excellent outline of the process and deliverables that constitute a solid business continuity plan and process. Also available are several white papers covering various DR/BC topics.

The process seems daunting, but if approached systematically with a good outline of where you would like to end up (like the DRJ GAP document), you can ensure that you optimize the time invested and maximize the value of the end product.

I find their quarterly publication to be interesting and informative as well (subscribe).

jnaab
  • 965
  • 6
  • 11
8

If we add our ideas we could create a nice wiki from this post once everyone has had added their own ideas. I understand there's bunch out there to follow, but some of us have specific priorities when it comes to recovery. To start, here's mine:

Make sure you have off-line/remote documentation of your network

l0c0b0x
  • 11,697
  • 6
  • 46
  • 76
8

With DR the basic things are your RTOs (Recovery Time Objectives) and RPOs (Recovery Point Objectives), which roughly translate as "how much time is acceptable to have to spend getting it back, and how much data can we afford to lose". In an ideal world the answers would be "none and none", but a DR scenario is an exceptional circumstance. These really should be driven by your customers, but since you're starting from the IT angle you can make best guesses, but be prepared to adjust up or down as required. Aiming for as close to "none and none" as you can reasonably get is good, but you'll need to be able to recognise when the point of diminishing returns gets in.

These two factors might be different at different times of the year, and different on different systems.

I like the more well-rounded approach; it's tempting to list out the events that can lead to a DR scenario, but these really belong more to a risk ananlysis/mitigation exercise. With DR the incident has already happened, and specifics of what it was are less relevant (except perhaps in terms of affecting availability of DR facilities). If you lose a server you need to get it back, irrespective of whether it was hit by lightning, accidentally formatted, or whatever. An approach focussed around scale and spread of the disaster is more likely to yield results.

One approach to use on customers, if you find that they're reluctant to get involved, is to ask them DR questions from a non-IT angle. Asking what their plans are if all their paper files go up in flames is an example here. This can help with getting them more involved in the broader DR thing, and can feed useful info into your own plans.

Finally testing your plan regularly is crucial to success. It's no good having a beautiful DR plan that looks great on paper but that doesn't meet it's objectives.

Maximus Minimus
  • 8,937
  • 1
  • 22
  • 36
4

Actually, the "single incident" development model is a good idea, as the first step. One reason is that is makes the planning exercise more realistic and focused. Plan for the flood, all the way. Then suppose a different incident (say, long term power outage), apply that plan to it, and fix what breaks. After a few iterations, the plan should be relatively robust.

Some thoughts ... - be sure to account for unavailable people. If there is a flood, you can't assume that all relevant staff are available. Someone might be on vacation, or injured, or dealing with their family.
- plan for communication problems and weaknesses. Have multiple numbers and multiple modes.
- the DR plan needs a chain of command. Knowing who makes decisions is critical.
- the plan needs to be widely distributed, including offsite and off the grid. It needs to be accessible during the disaster!

tomjedrz
  • 5,964
  • 1
  • 15
  • 26
4

Where I work, I've been involved in running a large-scale DR test in each of the last two years. We've discovered that testing our services, people and processes in "realistic" situations has been useful. Some lessons learned (perhaps obvious), in the hope you find them useful:

  • Untested services, despite what they've written in their DR documentation, usually have implicit, catastrophe-inducing dependencies. Shaking them out with a realistic test or two is a useful and measurable output of a DR preparation process.
  • Untested people tend to think that their systems are okay and they'll "know what to do" in a disaster situation. Shaking them up with a realistic test or two is great.
  • Untested processes fall apart rapidly in actual emergency situations. In particular, complex escalation processes focused mainly on informing upper management break in spectacular ways. Lightweight processes focused on the needs of operations staff and other responders, central sources of information about the unfolding emergency, explicit transfer of responsibility and 'everyday' emergency response procedures work best.

I guess what I'm getting at is that you should try not to make everything about your DR planning process theoretical. Push for permission to actually break things and thus get hard data on your organization's preparedness. That will require some serious support from management, of course, but it can be wonderfully focusing for the business to spend a couple of days really rehearsing for the worst.

Cian

3

There are several standards from the British Standards Institute (BSi) that focus on continuity management and disaster recovery.

  • BS 25999-1:2006 Business continuity management, Part 1: Code of practice
  • BS 25999-2:2007 Business continuity management. Specification
  • BS 25777:2008 Information and communications technology continuity management. Code of practice
chmeee
  • 7,270
  • 3
  • 29
  • 43
3

It may seem obvious, but to go along with the offsite documentation above, make sure you have offsite (preferably out of the region) backups. This could be an online storage service or a place to take tapes to.

I say preferably out of the region because I come from an area where we don't have many natural disasters annually, but, if/when we do have one, it is on a regional scale with mass destruction (earthquakes, volcanoes). Its all good to have your backup in a safety deposit box at the bank, until your bank is under liquid hot magma (/Dr. Evil Voice).

Something that I have read about is agencies sharing the cost of maintaining a hot site for when the big one does hit. They enact plans for restoring both companies' mission critical to the hot site using virtualization and such, and then share staffing on the level of make-sure-all-the-lights-are-blinking. Just a thought.

RascalKing
  • 1,138
  • 5
  • 7
2

For books there's Disaster Recovery Planning by Jon William Toigo, now in it's 3rd edition, with a 4th edition blook (blog+book) on the horizon.

pgs
  • 3,471
  • 18
  • 19
1

Laura,

Here is a link from SQLServerPedia that gives out basics of DR.

http://sqlserverpedia.com/blog/sql-server-backup-and-restore/disaster-recovery-basics-tutorial/

Santosh Chandavaram
  • 245
  • 1
  • 2
  • 10
1

Also read up on "Business Continuity"

freiheit
  • 14,334
  • 1
  • 46
  • 69