3

I have a NAS with some files that total up to 2TB in size. I suspect I could shrink this because there are possibly duplicate files. I plan on making a second backup by sending what I have to Google Drive. The concern is how I should be verifying my backups. I suppose if a took a few days to verify by hand but is there a better faster way to confirm my whole backup wasn't tampered with?

Should I be using a different method to backup that would allow for verifying backups instead of just running 'rclone' on all of my files?

EDIT: Adding a bit of info about my system. My NAS is a Raspberry Pi 3B+ running Arch Linux ARM Aarch64

EDIT 2: My concern is that if my local backup drives failed (maybe someone ruined them on purpose), that I have no way to be 100% sure all of my stuff (photos, music, etc.) are as I remembered.

forest
  • 64,616
  • 20
  • 206
  • 257
RansuDoragon
  • 105
  • 6
  • Can you perhaps be a bit more specific about your concerns? – Conor Mancone May 21 '18 at 22:48
  • 2
    What level of tampering are you worried about (accidental, normal adversary, nation state)? – jrtapsell May 21 '18 at 22:53
  • I added a bit to my original post. To add to that; with all the scary levels of computer assisted edits (i.e. deep fakes). I want to make sure that the stuff, mainly photos, I backup cannot be edited. I don't want to go through my files one day and see something that I have to question if I actually made that. OR if I ever had to show my backup to prove something and it contained a incriminating file that someone else managed to put there without my knowledge. – RansuDoragon May 21 '18 at 22:59
  • you can set the backup folders to append-only perms, so that only creation and appending works. this prevents over-riding archives, even with surreptitious perms. – dandavis May 21 '18 at 23:22

1 Answers1

9

Cryptographic hash lists

From a theoretical standpoint, verifying the integrity of data that will be put on an untrusted medium (such as a remote backup server or an unattended storage device) requires you keep something on a secure medium which is trusted and cannot be tampered with. That something allows you to verify that the rest of the data has not been tampered with. In other words, you are moving trust from the data itself to some auxiliary data (hashes, a cryptographic signature, etc). This is desirable because that auxiliary data is very small and can be stored locally on most mediums. In the case of a drive backup, you could use a cryptographic hash, such as SHA-256, on each file:

find /path/to/backup -type f -exec sha256sum -- {} + > hashlist.txt

The resulting text file contains a list of files and their SHA-256 hashes. The file itself can either be stored locally on a trusted machine, or cryptographically signed (e.g. with GnuPG or signify) and kept in the backup itself. This will prevent anyone with write access to the backup (but not write access to the hash list) from modifying, adding, or deleting files without being detected.

Digital signatures

If you are storing the backup in a large filesystem image or tarball, you can simply hash that instead, which will give you the additional benefit of preserving directories and metadata:

tar cpf - /path/to/backup | tee backup.tar | sha256sum > backup.tar.sha256

If you want to store the hash list in the backup itself, it's necessary to ensure that the hash list cannot be tampered with without detection. The most common way to do this is to use GPG. Once you have created a signing key, you can sign the file, whether the file is a hash list or an entire tarball. The signature can be stored alongside the backups and you can safely verify the backups using the corresponding public key. Just make sure you keep a local copy of the public key and do not allow the private key to fall into the hands of your adversaries. This technique has the advantage of being compatible with a number of security-oriented smart cards.

Secure backup utilities

There are some tools designed specifically for backups which provide integrity and confidentiality. Duplicity, for example, supports sending client-side encrypted incremental backups to a large number of online data hosts, including Google Drive. It encrypts and signs backups using a public and private keypair, though it also supports using a symmetric key (plain passphrase) instead. Because incremental backups are supported, only the initial backup of a large drive will take substantial time. Subsequent backups will only transmit the files that have changed:

duplicity /path/to/backup gdocs://user[:password]@other.host/some_dir 

Restoring from backup verifies the data, and data can be manually verified as well. As with GPG, you need to keep the keypair on a trusted medium (a local drive or a smart card). Read access to the private key allows forging valid signatures, as does write access to the public key.

This is what I use myself.

Things to be aware of

In order to ensure that the hash values are faithful representations of the contents of the files being hashed, it is necessary to use a strong hash function. Avoid using MD5 or SHA-1 as they are both broken (in that it is possible to create two files with differing contents but the same hash). The SHA-256 algorithm is the most popular, but SHA-3-256 and BLAKE2b are also acceptable.

If you are generating a keypair (because you are using GnuPG to sign a hash list or archive, or a backup utility that requires a keypair like Duplicity), you should use a strong algorithm. The industry standard is RSA for encryption and signing, with a keysize of 2048 or larger.

In a comment, you mention that you want to be able to use this to prove that incriminating files were not added with your knowledge. This is not likely to fly in court, because it would be trivial for you to upload a new, incriminating file and not sign it in order to claim that it was not added by you. All this will allow you to do is trust the backup as much as you trust the local hash list (or signing key).

forest
  • 64,616
  • 20
  • 206
  • 257
  • The problem I see is that if you use public-private keys then you can't include them in the backups. BTW, if I'm not mistaken, encryption (of the keys or of the whole backups) wouldn't guarantee integrity, would it? So in the end the most reliable and straightforward solution to me seems to just use secure checksums (like sha256) along with the files, and the integrity of the checksum files can be verified by comparing different copies on different places (so the checksum file on your PC must match the one uploaded on Google drive, etc). – reed May 22 '18 at 10:21
  • 2
    Encryption does not guarantee integrity, but Duplicity also signs the files. And you _could_ put the private key in the backup, if you encrypted it... – forest May 22 '18 at 13:25
  • I think I'm going to use Duplicity over rclone for offsite "cloud" backups now, using encryption and signing. Thank you for the suggestion and the well explained answer above. At least I have some peace of mind that my files were less likely to have been tampered with these methods of backup. – RansuDoragon May 22 '18 at 14:02