Is our planned backup strategy adequate for my new server insfrastructure?

Question

We're in the process of setting up a new server for migrating the old ones.
Basically we'll have a Windows Server (2003 or 2008) running 6+ virtual box servers (Windows and Linux development, applications, databases and a couple of testing workstations) on RAID 5.

Also we need to centralize data (files and SVN repositories), so a file server will be needed. As we don't have any admin experience and never done backup before, do you have any experience virtualizing file servers? It is best to run them on a physical box? Any advice on running this will be welcome.

About our backup strategy, the one sketched for now is:
Note: tape backup for now is not an option for us, because of money constraints.

Do a full backup weekly to a separate backup server on RAID 5 (see Should a backup server use RAID?) and to external drive (sort of poor man's tape drive)
Differential daily backups
Planning to do a monthly backup to online services

Do you think this approach is reasonable? Im sure there are a lot of aspect to carry about that we are surely missing.

Lastly, one think that worries us is how to backup virtualbox machines. One simple way is to simple backup everithing (as recommended in one of the questions, i can't find wich...).
What is your advice about the data contained in that vboxes? Should be backed up also ("just in case..."), or it is safe to backup the virtual images directly?

If it serves as additional info, we're planning to use BackupExec.

Thanks for thaking the time to read this.

----- 2009/08/04 UPDATE -----

Due to health reasons i couln't keep on with this question. Thanks to those who answered my question, it was a big help.

Here is the backup plan we've sketched now we've got more background: As we're a small company (from south america), for now we can't afford a tape drive.

I now bacukp isnt bacukp if it's not offsite & offline, but we're trying to get the better strategy for out money constraint:

Data loss window: 1 day/8hrs. Recovery Time: 1 day/8hrs. Stuff to backup: all (data & server installations)

Daily: do a diff backup daily to a physical backup server, possibly with BackupExec. Someone proposed to use one of those external storage hubs with sata support. Another proposed uploading it to a storage service while we can get a tape. We don't have the option for taking then offsite right now (so data loss window 'fake')
Weekly: take out a full backup with an external 1TB drive.
Monthly/Yearly: same as weekly. We have the problem of where to store those backups

We want to keep it simple, but i think we're getting it complex with all those daily strategies for overcoming the leak of offsite backup.

+1 for the question, this could develop into a nice backup do and don't FAQ which may be appropriate for Community Wiki. — Maximus Minimus, Jul 26 '09 at 11:23

score 2 · Answer 1 · answered Jul 26 '09 at 11:56

My standard backup advice:

The whole point of backing up is to be able to restore. Unless you're fully confident that you can get your stuff back, your backups are useless. Everything you implement in your backup solution should be coming from the perspective of "how do I restore from this?"

Tape isn't that expensive, and it has the advantage that it's far more durable than disk. Less moving parts, no live electrical current going through it on a constant basis, all good stuff. If it saves your ass once then it's already paid for itself in my book.

As well as "how much data can you afford to lose" you also need to consider "how long can you afford to be down for in the case of a DR scenario?" A 3 day restore time is 3 days of lost business. You should be counting your restore times in hours and on the fingers of one hand.

You can very quickly get into silly money if you allow yourself to get too paranoid about this however, so you should be looking to divide your servers into 2 or 3 lots. Those you absolutely need to get back NOW in order to continue your core business functions, and those you can defer until after the core ones are back. Put the heavy investment into the first lot, ensure that you have fully documented restore procedures (for the OS, for applications and for data) that a blind leprous monkey with one hand tied behind it's back can follow. Print and bind a copy and keep it in a fireproof safe - you're screwed if all you have is an electronic copy and that gets lost or destroyed. But don't think that this means you can get lax with the second lot, just that you can delay getting them back or take a little longer doing so (eg. by putting them on slower media).

Specific examples: your core fileserver goes into the first lot, for sure. Your HR server goes into the second lot. It's important to the HR people, but will your core business functions be OK for a coupla days without a HR system? Yup, I reckon they will.

Keep your backup solution simple and boring. Far too often I have seen people implement fancy or complex backup solutions that just end up being too complex, fiddly and unreliable. Backups are boring because backups should be boring. The simpler they are, the easier it will be to restore. You want a "me Og, Og click button, Og get data back" approach. Keep a daily manual element in there. This helps to establish a drill, which can avoid situations where someone forgets to change a tape or rotate a HD in the pool. You can fire the person responsible afterwards if this happens, but guess what? You're still in a position where you've lost a month of data.

score 1 · Answer 2 · answered Jul 15 '09 at 21:19

1

The key question is how much data are you prepaired to lose? One months? One days? 6 hours? 5 mins?

It gets more expensive as the data loss window gets smaller.

answered Jul 15 '09 at 21:19

James Moore

21
1

I think that initially the data loss window can be 8hs (full work day) or 4hs. Mail servers are external with backup service included, so we don't have to carry about them. – nick2083 Jul 15 '09 at 21:47
1

Then the answer to your question is no :) If there is a fire in your building one night then you could lose up to nearly a months worth of data. For 8hr loss you need to take a backup offsite every night, that could be a diff or a full (but remember your full becomes a single point of failure). Make sure you keep more than 1 full backup (keep the last few, one from 6 months ago and one from a 1yr ago incase something really bad happens). Make sure you also test your backups regularly by doing a full restore. Until you have restored the backup you cannot be sure it is not faulty. – James Moore Jul 15 '09 at 21:56
James, thanks for replying. If i take weekly a full backup offsite, wouldn't i be loosing a week of work instead of a month? About "but remember your full becomes a single point of failure": Can you elaborate on that? Im dont know well what it means. – nick2083 Jul 15 '09 at 22:10
If you take something offsite once a week then you could lose up to a weeks data. If you take something offsite every night (and keep it safe etc) then you reduce the risk of losing large amounts of data. Re "full backups being a single point of failure": Your differential backup requires the full backup to be restored. If your full backup is corrupted or lost then your differential backup is useless so your full backup becomes a single point of failure. – James Moore Jul 15 '09 at 22:13

score 1 · Answer 3 · edited Mar 08 '17 at 17:58

1

Nick,

I would strongly recommend you take a look at the book "Backup & Recovery" from O'Reilly.

http://oreilly.com/catalog/9780596102463

It will explain to you terms such as "single point of failure" as well as general strategy for backing up your critical systems.

This is a good book for anyone's bookshelf.

edited Mar 08 '17 at 17:58

Community

1

answered Jul 15 '09 at 22:36

KPWINC

11,274
3
36
44

Thanks! We've got it. Is a great book. In fact, i did't knew backup would be such an interesting topic. – nick2083 Aug 04 '09 at 23:40

score 1 · Answer 4 · answered Jul 26 '09 at 10:08

Raid is for the live system and may/should contain local backups and/or journaled snapshots.
Tape is shock-proof for travel, off-site backup. But tape does not handle high cycle rates (average 250 overwrites)
Disk is less expensive and faster than tape and has much higher overwrite capability.

If you don't have the expertise I would not recommend raid separately for a backup system. Redundancy is more important. A raid system made of 5 drives has overall a much higher failure rate than 5 separate drives. If the backup system fails, everything is down until a new system is built and tested. If the raid controller fails, everything is gone. If more drives than parity fail, everything is gone. You're often locked into the same controller requiring you to buy a spare controller or it will cost time finding and replacing it with the same controller if needed. You're somewhat locked in to a disk size and model. If a drive fails using separate disks you can buy a newer, larger drive for the same money.

Another option is to buy 5 - 1 terabyte external sata drives $90 each - Total cost $450

No machine needed, no raid card, no raid config, each drive can be a different make, model and size.

Rotate drives, use tape to store off-site at your company bank safe deposit box. You may have a larger amount of potential data loss window, but this can be mitigated by backing up to two or more or disk and tape at each backup schedule and/or adding snapshots/journaling on the live system.

If you can partition the data into public and confidential you can use extra space in workstations for the public backup pool. Put a terabyte in each workstation and assign 500mb from each to the backup pool. Use this area for public data backup copies or encrypted private backup data.

This is the easiest and fastest setup to recover from. Bacula works great with this style of backup. The best setups I've seen and used are live raid systems with local backups used for journaled differential backups hourly, then written to external disks - encrypted on local workstations spare space for redundancy and taped daily for off-site storage.

Raid makes sense for the active system. Upgrade your raid 5 to a raid 60 or whatever fits best with your data and load. Then use extra space on the live system to store snapshot backups. Local disk backup is the fastest possible and means the least time the system is locked for the backup transaction. Backing up these snapshots to externals or tape can then be done during lunchtime and low usage points during the day.

Create a backup plan with different frequencies for each data type, directory, file, etc. as needed. Backup locally as often as possible, preferably every file write. (journaling) Get the local backups off the system as soon as possible. (daily at least) Make as many copies as you can/need of the backup data. (5 is usually more than adequate)

Joe, what do you mean with live system? The production servers? So in the option you describe the plan is: journaling in each workstation the public data (encrypted). Those snapshots goes to the system with 5 external disks daily or take the daily off-site in one of the 5 disks? Also, the snapshots in each workstation are equal? or they all sum up the complete public data? Sorry if i missed something, English is not my primary language. What do you think of those external storage hub like http://www.linksysbycisco.com/US/en/products/NetworkStorage? — nick2083, Aug 04 '09 at 23:56

score 0 · Answer 5 · answered Jul 15 '09 at 22:18

I would suggest running the fileserver on a physical box, since it's likely to be quite I/O heavy. It would also be nice to be able to hotswap a dead drive, without powering down all VM's. This depends on your specific setup though.

Your backup schedule sounds reasonable, but depends on how much you can afford to lose. It looks like most of your backups (except the monthly one) are on-site, which means you'll lose at most a month if the building burns down, or is broken into.

If you take the external drive home, you'll have to keep it home, until right before the backup is due, otherwise it's not really an off-site backup, is it? If you're disciplined about it, you'll lose at most a week. Better would be to rotate a set of three external harddisks, so you'll always have the oldest one on-site, and the newest one off-site.

Don't forget to test and document your backups periodically; You need the peace of mind that each of your backup systems can restore correctly. You'll need documentation so one of your colleagues can restore data. You'll also need documentation on how to rebuild an entire server. If one fails, you'll have too much on your mind to remember every detail.

Off-topic: As it happens, I'm looking into a similar infrastructure for our small company. Similar experience levels, although we do have backups in place already. I'll share our current design with you, to give you an alternative perspective, not to judge yours:
We're planning three servers: two virtualization hosts, and one storageserver. The storageserver will most likely run Openfiler. It will be connected over (maybe dual) gigabit-ethernet to two hosts, both with good CPU's and plenty of memory, but barely any storage (maybe just small SSD's). Those hosts will run Citrix Xenserver (or maybe VMWare ESXi) on the bare-metal, because it's much more efficient than running the virtualization software inside another operating system that's basically not doing much (e.g. see the differences in performance between VMWare Server and VMWare ESXi). Xenserver seems most interesting since it provides enterprise features for free, while ESXi can get expensive if you want more than the basics. The Xenserver hosts will not have storage themselves, but will use block-level storage via iSCSI from the Openfiler server as virtual harddisks. Openfiler can do snapshots, RAID and so on. Xenserver can do Live migrations of virtual machines from one server to the other, so we can do maintenance on one server without shutting down any guest VM's. Get a gigabit switch that supports VLAN's, so you can separate the storage traffic from the VM traffic. A few UPS's to allow for controlled shutdown in case of power failure and you're done. Almost all of the cost is for the hardware, since the software is (amazingly) free.

Sorry that this answer turned out a bit long, but I hoped another perspective would be valuable to you.

Martijn, sorry for the delayed answer. You gave me a lot of help! I wasn't aware of the whole bare-metal/openfiler perspective, it sounds very good. Also, we might implement it but we have to get more background on that. Great help. — nick2083, Aug 04 '09 at 23:42

score 0 · Answer 6 · edited Apr 13 '17 at 12:14

bI'll make the comment that I always make about "backup":

Backup is off-site and offline. If it's not off-site and offline it's not backup.

Off-site is important if the building burns down. On-site but offline (think an unplugged external hard disk drive in a drawer) then it's gone when the building burns down (see Cleaning soot out of a server ).
Offline is important if someone attacks you and attempts to corrupt your data. If it's off-site but online then it's vulnerable to attack and "corruption". Offline means "air gap between the backup and the network".

The Tao of Backup is a bit of a cheesy sales pitch, but everything in the site's message is true and important. I'd advise reading it.

I would run a file server on a physical box. File serving is IO, and virtualization is a penalty for IO. Virtualization is great for applications that "demand" a separate operating system instance but don't need the horsepower of an entire physical box. For applications that are wholly IO based virtualization makes less sense.

You should read my Server Fault Backup Roundup spreadsheet comparing various backup solutions. LTO-4 and tapes for a 5 week rotation aren't that expensive. It's even less if you go with a lower-end tape technology like LTO-3, LTO-2, or VXA.

If you want even better recommendations about backup, tell us things like:

How much data total will be backed up
How much data changes on a day-to-day basis
How long is the window for backup
How many backups do you want to retain
How many backups per time period will you keep permanently
How often will you rotate backup medias off-site
How many medias / weeks do you want to rotate

You kinda say some of these things in your question now, but I wonder if you've really thought through, for example, what it would do to your business if you are doing monthly off-site copies and you have a disaster 2 days before the next monthly off-site copy. I would suggest you re-examine your requirements after talking to the operations people in your business and asking them how many dollars it would cost the company to lose various amounts of data (in terms of hours / days / weeks of data).

(You can get more of the details about assumptions made in my "Server Fault Backup Roundup" document at: Recommended Backup Media for Circa 2009?)

Thank you very much Evan, iv'e been following your answers and they helped me a lot. — nick2083, Aug 04 '09 at 23:44

score 0 · Answer 7 · answered Aug 26 '09 at 01:07

Answers for Nick - Keep in mind this methodology is for low-cost small business use, purchasing name brand pre-built systems for workstations. It's a scenario to make use of the extra wasted resources available. We use all available resources. When users leave for the day their workstations are rebooted into the cluster for automated build and testing. The backup method I put forth is a way to utilize the extra space in each workstation using multiple machines for redundant copies.

...Joe, what do you mean with live system? The production servers?

Yes. Raid is for reduced time loss. Therefore it should be used on a 24/7 running system. It has much less value for a backup system that only needs to be running during the backup data transfer or workstations that only "need" to be on during the day.

...So in the option you describe the plan is: journaling in each workstation the public data (encrypted).

Yes. It could be public shared or cross-workstation. Journal/snapshot the changes hourly on the raid system between backup transfers to another medium which is usually twice a day, noon and nightly. (Keep as much journaled backup as possible on the production system up to 80% of disk space. After this performance may take a hit.) This way users can easily recover overwritten or deleted files without talking to a sysadmin by going to their /username/date/time folder on the raid production system and use standard diff tools, have access to all available snapshots of the day, etc..

Encryption is in case a workstation is stolen and/or to protect against "prying eyes". We have good developers so you trust them to not try and decrypt. They can do damage to the business in many other ways, trust is required.

...Those snapshots goes to the system with 5 external disks daily or take the daily off-site in one of the 5 disks?

Traveling data is always on tape. Tape survives shock. Disk is faster for seeking, that's why we prefer disks as the "journal" backup. Tapes are full or incremental backups usually with no journals/snapshots. Most data recovery will be done during the day - for our user base. "I need the file the way it was before lunch." "I just deleted the wrong file." The granularity of restores from previous days are usually sufficient with one version per day. If more journaling is needed the backup is adjusted or a revision control system is implemented and the revision tree is backed up.

The five disks is an arbitrary number to show the relative cost against a tape only system. Five separate disks with copies of the same data have much higher redundancy than any small business raid system. If the workstations have adequate space, one dedicated backup disk may be sufficient. (Given that multiple copies are on workstations and tape)

At a set point in time data is transfered off the production servers journaled backup partition and moved to a backup system with external drive(s) connected making 2-5 copies, one on internal disk, one on external disk and to tape. The workstations are backed-up to the backup systems then receive a copy of the shared production system's backup before shutting down each workstation. At no time are there less than three physical copies of back-ed up data. The 3 copies, 5 copies, etc. is a redundancy question that needs to be modeled for each business and each type of data. You might want 5 copies of invoices, 7 copies of contracts, only 2 copies of a standard graphic and a single copy of the current test build executables, etc..

...Also, the snapshots in each workstation are equal? or they all sum up the complete public data?

Either. Depends on available space and needs. Our purchased systems always come with disks much larger than needed for the average user (developers may make use of extra space but the receptionist has no need for a 500gb+ disk)

...What do you think of those external storage hub like linksysbycisco.com/US/en/…?

Don't know. We prefer machines that can be put to another use, backup server today, someone's workstation tomorrow, offload copies of virtuals during a major upgrade for quick failover, etc.. That's one of the reasons for the external disk - to keep all workstations as similar as possible. Therefore the "backup server" will have the same 500gb+ disk that every workstation has. It's the same physical machine, purchased in sets, so over time there will be differences in CPU, memory and disk based on the deal du jour. Machines are allocated based on performance needs and swapping a new machine to increase memory takes less overall sysadmin time than installing a memory chip in a perfectly running machine. If we keep CPU and video (AMD64, Nvidia) relatively consistent machine swaps are painless.

The production server uses two raid cards one running 10k rpm scsi and another running 7200rpm scsi drives for maximum performance. A $60 SATA terabyte drive used for backup holds as much as thousands of dollars worth of scsi drives, raid controllers, hot swap rack case, etc.. Development servers are usually adequate with SATA raid, more space but less performance. Since there are less simultaneous users the performance difference is usually negligible.

In simple terms -

Production system - active shared data and OS on raid "primary data partition"
Production system - hourly journaled snapshots since the last backup on raid "backup data partition"
Workstation system - active data and OS on non-raid "primary data partition"
Workstation system - backup data on non-raid "backup data partition"

Average workstations purchased with 500gb+ drives and use ~40gb max for multi boot windows/linux/bsd/opensolaris partitions. The rest is the backup partition which contains backup copies of each others workstation OS's, production server's OS backup, production servers journaled data backups and/or productions servers incremental data backups.

If any two machines dies in the building recovery takes minutes. There are at least three physical copies on site of each OS and usually we have enough unused workstation + external drive space to keep a week or two of incremental backups from the production server and at least two copies of the last full backup.

We can lose the raid system, the tape and two workstations and not lose any data and be up and running within minutes. (albeit without the raid until it's repaired) But the data is accessible "instantly". This has saved hours of time during a failure which always seems to happen at the worst possible business time. Power supplies will invariably fail right before an important sales meeting/demo. Raid systems always seem to fail in the morning never on a Friday evening so you can fix them and be back up by Monday morning.

The docs describing the backup process are company property. I'll try and re-write for public viewing with diagrams and use cases. I've used this general methodology for many years now and it has saved time and data when the standard tape only systems fail. I've seen failures on IBM, Compaq, HP and Dell systems using DLT, LTO, etc. A common failure is no errors during the backup but when you try to restore the data is corrupted. Always test restore. That's one of the reasons why we use an online journal backup which can easily be tested daily. Since the users get used to it we never have gone more than a week without someone using the journaled backups and almost never use the tapes. The tapes are in case the building burns down.

Is our planned backup strategy adequate for my new server insfrastructure?

7 Answers7

Linked