Best practices for backup on a massively parallel grid system

Question

I work in the research group of a large company. We do a lot of work on a grid processing system with many nodes (More than 200, I'm not sure exactly how many) and several harddrives. More than 1000TB of data.

Most of this data can be re-produced, but that requires time. A lot of the data is code which is stored in separate RCS repos, which can have their own backup, but working copies are, of course, on the normal user-drives.

Can someone point me at a best-practices document, or something about how most companies go about protecting this much data?

Thanks

Brian - it looks like you haven't accepted many answers to your questions? That's how this site is supposed to work; make sure you read through the FAQ and get a feel for the reputation system. — mfinni, Jun 21 '12 at 14:04
I understand the system. I haven't accepted any answers because I've never gotten more than one answer on any of my questions, and it wasn't satisfactory... — Brian Postow, Jun 21 '12 at 16:32

score 3 · Answer 1 · answered Jun 20 '12 at 17:10

3

Hire a backup admin or engineer.
Give him or her your requirements and budget. (this may be an iterative process.)
Do what he or she says.

There's a lot to designing an effective backup system for your business needs. You might snapshot the data to other disks and then mirror off-site (if you have another site), or send to tape, or just send to tape directly from your nodes. There may be concurrency issues of data backed up at different times - perhaps your application needs to export or quiesce first? We don't know, you didn't tell us. There's a lot of technical questions and issues.

And the first thing that needs to be addressed is your actual business needs - what's your RTO (how long can you be down until your data is restored) and RPO (how much data can you afford to lose between backup runs) ? Does this need to be part of a DR or business continuity plan, or if the building burns down, do you just not care about your data anymore?

answered Jun 20 '12 at 17:10

mfinni

35,711
3
50
86

I was more looking for what kind of things other companies do... Because my higherups are going to say "budget = $0" I'm trying to figure out what the budget SHOULD be... – Brian Postow Jun 21 '12 at 16:30
What "other companies do" is bound by their RPO/RTO and other requirements - so unless you describe your needs better, you're not going to get specific answers that are relevant to you. There is most definitely no single "best practices" document for backups. – mfinni Jun 21 '12 at 17:01
Example - how long would it take you to recreate the data in your environment today, as you mention in your question? Say it's 12 hours. Imagine that (for example) recovering that data from nearline disk would take 2 hours. Was the investment in disk and backup software (and your time) worth the 10 hours saved? If you might have to do it monthly, would it be worth the 120 hours saved every year? – mfinni Jun 21 '12 at 17:04

Best practices for backup on a massively parallel grid system

1 Answers1