Our shop relies very heavily on NetApp Volume Snapshots for backups. We use traditional agent-based tape backups for some of our data but by and large we rely on the Snapshots for the majority of our systems. Furthermore we do not have a rigorous change control policy or any centralized configuration management so all of our servers, regardless of whether the data their services provide is backed up, would need to be rebuilt from bare-metal (and without any real documentation). Naturally, this makes snapshots a very attractive proposition for management because we can just recover the entire server, user data and configuration included. We use NetApp's Virtual Storage Console for making snapshots of our NFS-based VMware datastores and NetApp's SnapDrive for raw device mapped (physical) LUNs that are presented directly to guests. We SnapMirror critical snapshots offsite to another Filer. Naturally we regularly test our restore process.
I can't help but feel uncomfortable with our reliance on snapshots on backups. To me, for a technology to considered a sufficient as a backup strategy it needs to meet the following criteria:
- The backup needs to be atomic. That is to say, the backup cannot rely on anything else for its recovery.
- The backup needs to be separated from the system it is a backup of (out of band).
- The backup needs to be copied or transported to remote site (off site)
It is my understanding that NetApp Snapshots work under a Redirect-On-Write (RoW) methodology. The WAFL file layout uses a set of pointers (metadata?) that actually reference each block of storage where ever it might be. To make a snapshot, the system just takes a copy of a volume's metadata and stores it in that volume's reserved space. Any writes (creations/changes/deletions) are redirected to new blocks. This is supposed to be the special sauce that makes NetApp's WAFL so great because you don't have do the read and then write the old data to the reserved space and then write your new data over the old like Copy-On-Write snapshots.
I fully admit I might not understand exactly how NetApp Volume Snapshots work but if my understanding is more or less correct NetApp Snapshots fail to meet my criteria for backups.
- They are not atomic. The "snapshot" is really just a set of pointers to the original data. If the original data is no longer there, the metadata is useless.
- The snapshot is not separated from the system. If someone deletes the wrong volume I lose the snapshot. If the NetApp Filer explodes into tiny little kittens I lose the backup. I can use SnapMirror to move my snapshots to another Filer but again, it's just moving the metadata not the actual blocks. If I lose the original volume, I can't see how a snapshot copied to another Filer is going to help.
Can someone explain how NetApp Snapshots can be considered backups? I'm looking for Good Subjective answers so please support your position with facts, references and experience. If my understanding the underlying technology is incorrect, please explain where and why that changes my conclusion. If your shop relies on NetApp Snapshots as backups, please include enough contextual information so that people can get a sense of what kind of recovery policy you have to meet.