7

I always hear people say "test your backups", but I have no idea how that is done in practice when you have to deal with complex infrastructures.

For personal backups it's easy to rely on something like checksums, because all you have to recover is your own files (pictures, zip files, documents, etc.). So if you compare the checksums every once in a while, you can be sure that your files are readable and are ok. You can then restore them manually, if you need to.

But when you need to backup data in businesses / companies / organizations, you have to deal with complex infrastructures: large offices, several servers, multiple machines, users with different roles, lots of configurations, more files, etc. And you also need to avoid downtime, so I suppose the whole environment should be able to be restored quickly and automatically. The question is: how do you make sure the whole backup process works and keeps on working (generating valid backups that can actually be restored), in such a complex infrastructure? Testing with full restores doesn't seem feasible, I guess, unless you have a "copy" of the whole company to use as a testing field (like a huge empty office with lots of machines to simulate a restore). So I suppose you need to test the backups in other ways, but I have no idea what this whole testing process could be.

So what I'm asking is how backups are tested in large-enough companies, in practice, including all the necessary steps (what is usually backed up, what is usually not backed up, where it is backed up, how it is tested, how often, and what machines, software, and people are usually involved in the process).

reed
  • 15,398
  • 6
  • 43
  • 64
  • 2
    This sounds more like a disaster recovery plan that needs to be in place. I am not sure if this is the right area to ask this question in to be honest. – Jeroen Nov 23 '20 at 22:04
  • 1
    I see many questions in this question ! – elsadek Nov 24 '20 at 06:33
  • 1
    "I always hear people say "test your backups"" – Actually, that's the boring part. It is much more important to *test your restore*. I know of a company who had the most beautiful, well-tested backups, and then one day when they had to do a full-system restore, they realized that the firmware of the very expensive tape robot was *juuuuust* incompatible enough with the firmware of the very expensive SCSI HBA in their very expensive midrange server that booting from tape did not work, which is however, *exactly* how full-restore is done in this system. – Jörg W Mittag Nov 24 '20 at 10:23

2 Answers2

3

It looks like you are combining a few different concepts into one: copy validation, testing of restoration procedures, and fully restoring the entire enterprise. And It also looks like you are combining the difficultly of restoring data files and entire infrastructures. We need to pull this question apart.

Checksums are for validation of the integrity of the copy. It's not a "test". Validation is done at copy time to guarantee a faithful copy.

Infrastructure complexity does not make verification more difficult. It does create a complicated environment to test the restoration process. But this is easily accomplished using virtual environments, or dev/test infrastructure. I often use the virtual dev environment to restore core services and their dependencies. In one org, we would restore core VMs nightly from the previous backup/snapshot, run automated tests, and delete the VMs.

But for full enterprise restoration, that's a different matter entirely. I know of no org that tests an end-to-end restore of the entire enterprise infrastructure. At best, I have participated in tabletop exercises to map out dependencies and then we validated parts of the process. It is simply not feasible to test a whole-enterprise process, if for no other reason than the many risks that it can present.

So when people say "test your backups", what they mean is that you have tested that you can restore what you've backed up. It is best if you can test the end-to-end processes and to be able to restore the entire enterprise, but that's not expected because the costs, complexities, and risks get to be very high.


The reasons for testing are numerous and borne from a thousand horror stores.

Like:

  • The new tech who was taught the backup tape handling procedure from his predecessor but mixed up the process for packing up tapes for off-site storage and degaussing the returned tapes. The poor guy was degaussing the new backup tapes for months.
  • The tech who would flip the "copy protect" switch on the tapes before inserting them in the tape machine. The machine was old and didn't alert on this problem. The company had blank backup tapes going back years.
  • The tech who bought a cheap USB drive for backups, but the USB drive was fraudulent and kept overwriting old bytes to make room for new ones. It reported the saved files in the file system, but the files were not actually there.
  • The company who kept a very old tape backup system in operation, but the system was so degraded that the equipment was no longer able to perform reads from media, only writes. And the equipment used a proprietary format, so although the data was likely on the tapes, the company had no way of accessing it.

For all of these stories, a simple and regular test restoration would have highlighted a problem before the saved data was needed.

schroeder
  • 123,438
  • 55
  • 284
  • 319
  • 1
    Such a nice horror stories from real life! +1 – Esa Jokinen Nov 24 '20 at 12:04
  • Fun fact: for the last story, the system was literally held together with duct tape, and they set up an alert on Ebay to catch if anyone was selling original hardware/parts. – schroeder Nov 24 '20 at 13:44
  • Could you expand a bit on the part about "testing the restoration process, easily accomplished using virtual environments"? You also said you test restoration of core services, so I suppose there's also stuff that doesn't get tested, and I'd be interested to know what is usually left out of restoration tests. Thanks. – reed Nov 24 '20 at 14:11
  • 3
    @reed It's risk-based. If I can rebuild a system from scratch without a lot of impact, I'm not going to invest in processes to test restoring it. Backups for those systems become a convenience. If my website is only used as a landing page for searches and I don't depend on it for business, I'm not going to put infrastructure around it. I could pay someone on Fiverr to whip up a new one for $100. My finance system, though, that's going to be tested, audited, re-tested, and tabletopped. – schroeder Nov 24 '20 at 14:18
  • 1
    @reed I have often installed a hypervisor with a small virtual network where I can test restore server backups and any dependencies (that's why the network). So, I can back up a whole server, image it, pump it over to the virtual environment as a VM copy, spin it up and test it out. Then I can destroy the copy. Virtual environments are a huge improvement on what I used to do, which was to have bare metal systems and hire junior techs to restore servers on rotation to test backups (and train the techs) – schroeder Nov 24 '20 at 14:22
  • 1
    Those poor junior techs – unemployed & replaced by virtualization. :'( – Esa Jokinen Nov 24 '20 at 15:33
  • 1
    @EsaJokinen yep - now they learn virtualisation, not configuring OS from bare metal :) – schroeder Nov 24 '20 at 17:09
-2

Large corporate infrastructures will usually have some form of automation involved in the back-up and verification. This automation decides for different backup tasks, which task should be done first currently. Usually the decision involves looking at the current back-up speed, size of the data being backed-up, how often the file is being edited, how necessary the file or data is, and what time of the day it is.

The back-up process is done very selectively during peak usage hours, where most backup is done during off time hours like night. This practice is done to reduce latency by a lot. As for syncing edits done to files, there usually is a time delay in the range between 1 to 2 weeks for edited non-core or non-extremely needed files or data. Whereas edited core files and data are usually synced in intervals of 3 to 5 days. This too is done to reduce latency of the backup process. During the back-up process and syncing a hash of the file will be stored.

Verification on the other hand is only looked at when using back-up to restore data. matching the hash to the calculated hash of the backed-up data will show if the back-up has been tampered or not.

Amol Soneji
  • 346
  • 1
  • 5
  • 2
    This doesn't answer the question, which is about **testing** the backups. Obviously verification during the recovery doesn't help at all if the backup didn't work in the first place. – Esa Jokinen Nov 24 '20 at 05:15