11

Our shop relies very heavily on NetApp Volume Snapshots for backups. We use traditional agent-based tape backups for some of our data but by and large we rely on the Snapshots for the majority of our systems. Furthermore we do not have a rigorous change control policy or any centralized configuration management so all of our servers, regardless of whether the data their services provide is backed up, would need to be rebuilt from bare-metal (and without any real documentation). Naturally, this makes snapshots a very attractive proposition for management because we can just recover the entire server, user data and configuration included. We use NetApp's Virtual Storage Console for making snapshots of our NFS-based VMware datastores and NetApp's SnapDrive for raw device mapped (physical) LUNs that are presented directly to guests. We SnapMirror critical snapshots offsite to another Filer. Naturally we regularly test our restore process.

I can't help but feel uncomfortable with our reliance on snapshots on backups. To me, for a technology to considered a sufficient as a backup strategy it needs to meet the following criteria:

  • The backup needs to be atomic. That is to say, the backup cannot rely on anything else for its recovery.
  • The backup needs to be separated from the system it is a backup of (out of band).
  • The backup needs to be copied or transported to remote site (off site)


NetApp Snapshots

It is my understanding that NetApp Snapshots work under a Redirect-On-Write (RoW) methodology. The WAFL file layout uses a set of pointers (metadata?) that actually reference each block of storage where ever it might be. To make a snapshot, the system just takes a copy of a volume's metadata and stores it in that volume's reserved space. Any writes (creations/changes/deletions) are redirected to new blocks. This is supposed to be the special sauce that makes NetApp's WAFL so great because you don't have do the read and then write the old data to the reserved space and then write your new data over the old like Copy-On-Write snapshots.


I fully admit I might not understand exactly how NetApp Volume Snapshots work but if my understanding is more or less correct NetApp Snapshots fail to meet my criteria for backups.

  • They are not atomic. The "snapshot" is really just a set of pointers to the original data. If the original data is no longer there, the metadata is useless.
  • The snapshot is not separated from the system. If someone deletes the wrong volume I lose the snapshot. If the NetApp Filer explodes into tiny little kittens I lose the backup. I can use SnapMirror to move my snapshots to another Filer but again, it's just moving the metadata not the actual blocks. If I lose the original volume, I can't see how a snapshot copied to another Filer is going to help.



Can someone explain how NetApp Snapshots can be considered backups? I'm looking for Good Subjective answers so please support your position with facts, references and experience. If my understanding the underlying technology is incorrect, please explain where and why that changes my conclusion. If your shop relies on NetApp Snapshots as backups, please include enough contextual information so that people can get a sense of what kind of recovery policy you have to meet.

  • You might also get some useful insights / best practices from the toasters admins mailing list at http://www.teaparty.net/mailman/listinfo/toasters . (Disclaimer: I run the list.) – MadHatter Apr 13 '14 at 05:34
  • 4
    I strongly believe that backup must be both off-site and offline. A malicious attacker can't launch an electronic attack that erases a tape in lock box. You're making an attacker invoke kinetic means once you take backups offline. – Evan Anderson Apr 13 '14 at 13:15
  • As you stated in the question itself, you already realize that snapshots are not a copy of the data. That's why SnapMirror is needed. So why are you asking about snapshots rather than whether snapshot+SnapMirror is a valid backup mechanism? – 200_success Apr 13 '14 at 17:44
  • You often take backups of things that aren't mirrored. Nonprod environments, for example. They take a long time to rebuild, but won't bring the business down if you lose them. – Basil Apr 13 '14 at 19:56

3 Answers3

15

Backups serve two functions.

  • First and foremost, they're there to allow you to recover your data if it becomes unavailable. In this sense, snapshots are not backups. If you lose data on the filer (volume deletion, storage corruption, firmware error, etc.), all snapshots for that data are gone as well.
  • Secondly, and far more commonly, backups are used to correct for routine things like accidental deletions. In this use case, snapshots are backups. They're arguably one of the best ways to provide this kind of recovery, because they make the earlier versions of the data available directly to the users or their OS as a .snapshot hidden directory that they can directly read their file from.

No retention policy

That said, while we have snapshots and use them extensively, we still do nightly incrementals on Netbackup to tape or data domain. The reason is that snapshots can not reliably uphold a retention policy. If you tell users that they will be able to back up from a daily granularity for a week then a weekly granularity for a month, you can't keep that promise with snapshots.

On a Netapp volume with snapshots, deleted data contained in a snapshot occupies "snap reserve" space. If the volume isn't full and you've configured it this way, you can also push past that snapshot reserve and have snapshots that occupy some of the unused data space. If the volume fills up, though, all the snapshots but the ones supported by data in the reserved space will get deleted. Deletion of snapshots is determined only by available snapshot space, and if it needs to delete snapshots that are required for your retention policy, it will.

Consider this situation:

  • A full volume with regular snapshots and a 2 week retention requirement.
  • Assume half of the reserve in use for snapshots based on the normal rate of change.
  • Someone deletes a lot of data (more than the snapshot reserve), drastically increasing the rate of change, temporarily.

At this point, your snapshot reserve is completely used, as is as much of the data free space you've allowed OnTap to use for snapshots, but you haven't lost any snapshots yet. As soon as someone fills the volume back up with data, though, you'll lose all the snapshots contained in the data section, which will push your recovery point back to the time just after the large deletion.

Summary

Netapp snapshots don't cover you against real data loss. An errant deleted volume or data loss on the filer will require you to rebuild data.

They are a very simple and elegant way to allow for simple routine restores, but they aren't reliable enough that they replace a real backup solution. Most of the time, they'll make routine restores simple and painless, but when they're not available, you are exposed.

Basil
  • 8,811
  • 3
  • 37
  • 73
  • `Deletion of snapshots is determined only by available snapshot space, and if it needs to delete snapshots that are required for your retention policy` - This is something I didn't even consider. Excellent point. –  Apr 15 '14 at 16:10
  • You want to have some fun? Try doing snapshots on a snapmirrored volume for flexclones of the target. Then try using 100% of the non-reserve space on the source. It works until the snapshot backing that flexclone gets deleted on the source volume, at which point replication *stops*. – Basil Apr 15 '14 at 16:21
  • 1
    While I agree with you for the most part, I'd probably correct you on your first point. Remember the 3-2-1 backup rule and that the 2 stands for across two different media. SnapShots fit will as one of your three copies and perhaps your more common restore scenario. They aren't your off-media copy or your offsite copy. So, I'd say SnapShots serve as backups but aren't sufficient as your ONLY backups or whole backup strategy. I think this is what you were getting at; but, I feel like this is slightly more nuanced. – abegosum Mar 24 '15 at 15:11
  • Nice distinction between the two (comparably important) functions of backups, which can be more tersely referred to as *disaster recovery* and *moron recovery*, respectively. – MadHatter Jul 03 '19 at 08:01
8

They are a backup, yes. I've personally used them in place of daily incrementals before, but we still did weekly fulls to tape.

They protect quite well from any non-netapp (systems accessing volumes) user or admin errors or problems.

They do not protect from catastrophic hardware failures of the netapp itself. My understanding is that SnapMirror does copy all of the data (in the snapshot) to the other filer[1], so SnapMirroring to another filer should protect that dataset from catastrophic failure of a single filer.

The one major problem, of course, is that if somebody managing the netapp deletes the volume, then all the snapshots go with it. SnapMirror to another filer should adequately protect against that.

If all your NetApp filers are in the same data center, then you don't have anything covering a major disaster, the way that tape backups shipped offsite would give you.

You'll get better backups of your VMs and any databases (or database-like things) if you use the appropriate SnapManager agent, which will coordinate quiescing the data briefly as the snapshot is taken. If a given VM and its data is contained entirely within a single NetApp volume, then the snapshot of that VM should be crash-consistent. That is, it should be just as good as if you pulled the plug on a server and imaged the drive, which would typically mean filesystem checks and the database equivalents. If a database's data is split between LUNs, it seems like there's a significant risk of data corruption.

If it were me, I'd set up all databases to do regular backups to local disk, and set those jobs to keep a copy or two. That gives you a much better guarantee of recoverability.

[1] http://www.netapp.com/us/system/pdf-reader.aspx?m=snapmirror.pdf&cc=us

freiheit
  • 14,334
  • 1
  • 46
  • 69
  • +1 for mentioning SnapMirroring to another filer; people do seem to be overlooking that functionality. – MadHatter Apr 13 '14 at 16:15
  • 1
    Snapmirroring to another filer won't protect you from snapshot autodelete shortening your recovery point, though. It does protect against volume deletions and filer loss, though. – Basil Apr 13 '14 at 23:38
2

You should go read @Basil's excellent answer right now but here is my two cents:

Snapshots are not application aware

Just because you take a snapshot of the underlying storage volume does not mean the data on that volume is recoverable. MS SQL is a great example of this - you need to make sure your database is transaction-consistent before you snapshot the storage it is using otherwise as @freiheit mentioned you are no better off than recovering from a hard down failure. DBAs love to using different LUNs for different parts of SQL to better utilize the storage system, temp databases on fast storage, system databases on slower storage, read-only or archived data on bulk storage, and working data somewhere in between. If you are just snapshotting those volumes it is highly unlikely you will be able to recover your database.

NetApp supplies a number of Snap tools to make snapshots application aware. SnapManager for SQL provides that awareness. In the Microsoft ecosystem I believe there are also SnapManager tools for Exchange and SharePoint. SnapDrive does not have this application awareness. It just provides a convenient method to manage storage within the guest.

If you are storing all your IIS data and configuration on LUNs and snapshoting those LUNs directly you cannot guarantee that data is recoverable. Ask me how I know...


Multiple storage types can have different snapshots schedules

If you are presenting storage to your servers in different ways this can complicate your snapshot and recovery picture. NetApp's ONTAP is a multi-protocol offering and it is very possible you are using more than one method or storage type for a particular server. In our shop some of our server's get their C:\ drive over an NFS-based datastore and their "storage" drives over Raw Device Mapped LUNs. We were taking snapshots of the RDM LUNs but not the NFS-based datastores. This made recovering the server, difficult.


Snapshots do not have a guaranteed retention policy

Again, @Basil really covers this well but it's worth reiterating. It is possible to fill up your Snap Reserve in such a way where Snpashot Autodelete removes snapshots that have not naturally aged to deletion. Again. This can be really bad if you or your customers are expecting three weeks of snapshots to be available.


Snapshots are inline

This is the drawback of integrated storage... it's well... integrated. Your snapshots reside on the same platform you are backing up. If the volume or the Filer it is on disappears so does your backup. You can mitigate this somewhat by copying the snapshots to another Filer using SnapMirror as I erroneously stated in my question that the SnapMirror copy is not a full copy.


Snapshots enable bad operational practices to continue

One thing that I have noticed is that snapshots enable managers and customers to continue terrible operations behavior. In our environment we have very poor documentation and configuration management practices. This means that most servers start with the same base (a template or an image) but are then configured manually by different groups of people. As they continue their life, the servers diverge further and further from the template in ways that are generally not documented or implemented with configuration management.

And then come snapshots! We don't need to step back and address some of our fundamental operational practices because we can just snapshot all our servers! And we can use SnapMirror to move those snapshots off-site so we can use them as backups!

I think this is the wrong lesson to learn here. A better lesson to learn is that the configuration management framework, even if it is as simple as a changelog, should be backed up for the purposes of bare-metal restore. Snapshots are a fantastic tool but I can there is a temptation to be overly reliant on them to the determent of important fundamentals.