32

My organization recently bought a storage system. It has 1.5Petabyte, with RAID6, and there is an online synced mirror in a physical different location.

The system allows rollback / file recovery, by default allowing up to 30 days but this can be increased.

There is a discussion going on if we need some kind of extra backup for data living only on the storage.

The system has a very good level of redundancy, it has geographical redundancy and allows up to some extent rollback which means we can recover up to the defined time (30 days by default) old data or accidentally deleted data.

Given this scenario does it still make sense to have a "traditional" backup? By traditional, I mean a dedicated backup system, with snapshots that we can retrieve in case something goes wrong.

Do we really need it? Am I missing something? Am I just thinking by the traditional way and being over zealous?

nsn
  • 501
  • 3
  • 12
  • If it also allows you to replicate the snapshots off to another device then you can overcome the problems that Sven mentions in his answer. – Drifter104 Oct 06 '15 at 09:53
  • 4
    Definitely related, but perhaps not an outright duplicate due to geographic separation and snapshot rollback capability: [Why is RAID not a backup?](http://serverfault.com/q/2888/58408) – user Oct 06 '15 at 10:58
  • As long as you also remove the "delete" key from every keyboard in the place, you're golden ;-) – Tom Newton Oct 06 '15 at 19:40
  • @TomNewton naaa.. people now use the mouse. You can still right click ->delete :) - Anyway, I think you miss the part wheer I say the system supports file recovery for 30 days. – nsn Oct 06 '15 at 19:43
  • 1
    Certainly better than not having that. I would still prefer that the backups live on a medium away from live "people mistakes". Still, you know the answer to your question, but it involves putting a price on your data. Good luck. – Tom Newton Oct 06 '15 at 19:50
  • 7
    Does your "rollback" capability also cover the changes to volumes? For example, will it be able to recover if somebody removes all volumes? – vhu Oct 06 '15 at 20:07
  • Can you provide the actual type of device and storage system you're using? – ewwhite Oct 07 '15 at 12:48
  • @vhu that's a intersting question that I cant answer. I will check. – nsn Oct 08 '15 at 07:27

5 Answers5

40

What you describe is essential a geographically distributed RAID and a RAID was never a backup.

Online sync usually means everything you do on the primary storage gets immediately replicated to the backup system, including operations like the deletion of (all) snapshots and/or volumes by an attacker or simply an admin error.

Sven
  • 97,248
  • 13
  • 177
  • 225
  • 3
    Or, since both storages probably uses the same OS, a software bug could destroy the data. Not probable, an admin error is more probable, but possible. – Sunzi Oct 06 '15 at 11:20
  • 8
    True. The goal is that no one iwll be able to manage the automated snapshots. That should give the level of resilience against mistakes. Of course one can also delete a backup by mistake. – nsn Oct 06 '15 at 12:10
  • 2
    @nsn there are many other correlated failures such as bugs in the device software or bugs in your management scripts. Without a backup somewhere else you are entrusting your job to the vendor... Are you willing to do that? Also quantify the damage in case of loss. Maybe the answer depends on how valuable the data is. Is the company gone without it? – usr Oct 07 '15 at 09:05
  • 2
    @ nsn *> Of course one can also delete a backup by mistake. <*-- yes but it becomes substantially more difficult when the backup is taken offline and placed in secure, offsite storage, for example. – Rob Moir Oct 07 '15 at 11:51
7

The 30-day rollback is a great capability, but what if "critically-important-file-xyz" became corrupt/damaged and this was not detected until 31+ days later? This situation is the difference between back-up and archival schedules, but in your description the latter is not mentioned. Archival systems are usually stored on very low cost tape. Also no information is available on whether the business is one that has regulatory or other requirements to retain data for longer than 30 days, which is frequently the case.

If this is not the case for your situation, then you should be good.

200_success
  • 4,701
  • 1
  • 24
  • 42
  • 3
    Yes, true. The 30 is just the default we can set other values. Anyway, offline storage also costs money and doesn't stick forever. There will always be a day n+1 – nsn Oct 06 '15 at 22:01
  • 2
    I like to have rolling 30 days, plus monthly for the last year, plus a yearly. I've had a number of files (that were important and old) vanish and not be detected within the rolling time period. The yearly backups can be life savers. – Brian Knoblauch Oct 07 '15 at 13:52
  • @BrianKnoblauch: Yes, that sort of scheme is a good idea, for either online snapshots or offline backups. – Ben Voigt Oct 07 '15 at 19:47
6

Having geographically separated machines both having the data is good.

What happens when you have multiple failures involving both or all your sites? A fire at one, theft of the servers at the other? Or there is a problem with the line between them, then the primary location's server goes out, and the HD controller goes ape and writes junk? Or some insider performs malicious acts on both? Or the FBI confiscates your servers at both locations because of suspected ( you would never, but, maybe you are co-hosted in a datacenter with schmucks ). Or.. I am reminded of several high profile "cloud" outages where everything was redundant, analyzed to the nth degree, but, still, things can go wrong. I'll grant you these are all unlikely, but you've acknowledged that unlikely things can happen.

So, it comes down to how important/valuable is that data? What will the organization do if it ends up gone?

  • 3
    If you have two locations and you lose both then you've probably also lost your backups. Most of this answer is an argument for replication across more than two sites, not an argument in favour of backing up. – Ben Oct 07 '15 at 06:52
  • 2
    That goes forever. Each time you add a level of redundancy you can allways expect it to fail (either geographical, or just disks). If you have n redundant disks you can always ask "what if n+1 breaks". You can have a fire in your server room and in your backup room also. An inside jobs can also attack both. There aren't 100% failsafe systems. The thing here is to know if such setup can be equivalent to a "traditional" server + backup – nsn Oct 07 '15 at 07:43
  • 1
    I think @nsn makes a great point, but I also think that the lesson from many of these answer is that having your backup exist on a separate technological infrastructure from your storage media is a good idea, because it makes it much harder for a technological failure to propagate, and harder for a malicious actor to infect both (but merely harder). We regularly see bugs in redundant systems that cause failure cascades. Having a different solution/vendor involved helps. This hedging still goes on, but I consider that level of technological separation to be reasonable caution in most cases. – Nick Oct 07 '15 at 19:36
  • @Nick, I think you have a very valid comment. I would make it an answer. – nsn Oct 07 '15 at 21:04
4

The question here seems to be about just how disconnected and geographically distinct a replicated copy of your data needs to be before it's a backup and not high availability/redundancy infrastructure. My gut is that you're close, but still need a backup.

To bring together (cherry-pick) some thoughts in the other answers and comments, you can go really far down the path of "well, X technology doesn't cover Y disaster scenario, so it's not a backup," and at some point you need to decide what's reasonable for you, which seems to be why you're asking. My feeling on this, and I think the feeling of many of the commenters, is that your backup needs to exist on a separate technological infrastructure from your in-use data so that failures, accidents, and malicious actions either can't propagate or have a much higher hurdle to cross. An example given in the comments is someone deleting the volumes, which is a valid, not pie-in-the-sky scenario in my opinion. But additionally, a real-world example from my work. The university I work for (but thankfully don't manage this infrastructure for) has some serious high-availability virtualization infrastructure that supports a lot of the campus facilities. It's at multiple sites, but is all running on one vendor's platform. An obscure bug cropped up one day that caused a failure cascade that first took down a single server, then when the load shifted, it took out the rest of that site, and then when the load shifted again, it took out the other sites hosting that infrastructure. (I believe they've resolved this issue since then). The data wasn't lost in this case, but it's feasible to imagine a scenario involving your data where it was.

You want your backup to be immune to all of that, and even accessible while that infrastructure is down. If the data is unavailable for a week while your RAID rebuilds, being able to recover business critical documents from backup is nice (though not required). If your RAID disappears, then replicates to your other site, you'll really want that backup to be from a separate vendor or on some isolated media like tape.

All this said, I'll again repeat that your backup should be on a separate infrastructure from your data. There are many levels of isolation here, but I think anything connected through direct replication is too close to be a backup. You'll want something in addition.

Nick
  • 183
  • 1
  • 7
1

Assumption: the storage system will be used by many applications.

I consider you will do much better with a separate backup system.

RAID and mirroring are not backup but builtin rollback feature can replace a traditional backup system.

BUT:

I prefer the recovery policies to be application/data based and not storage based because:

  1. applications have different requirements related to recovery and acceptable loss of data (some of them imposed by various regulations: read-only mediums, encryption, keep last X years, etc),
  2. some applications have (very) good backup and recovery tools (oracle, mssql) builtin and are recommend way to do the backup/recovery part (as an Oracle DBA, I prefer and I will do all my backups related to Oracle with rman).
  3. growth, your usage of space can growth much quicker then you expect, now this system can accommodate 30 days of rollback data, this is not guaranteed in future
  4. cheaper, the cost of using bigger tapes to accommodate backup/recovery policies, after several years of growth, will be smaller then the cost of buying new, bigger disks in order to respect the same rollback window as now
valentin
  • 131
  • 4