2

This question follows this one, in regards to encrypting files individually in order to upload them to a cloud service.

--- Scenario:

  1. I have my folder full of unencrypted files

  2. Via script I make a shadow copy of all of them

  3. Via script I encrypt them one by one with recursive GPG encrypt commands using my own public key the --symmetric option and a dedicated passphrase

  4. Via syncing app I upload them

  5. Via script I delete the shadow copy (not sure about this one: but how could I later compare unencrypted and encrypted files in order to figure out which ones need replacement?)

  6. By the time I re-run the procedure, some of my original files have now changed. Ideally, only these ones will need to be uploaded. I repeat points 2. and 3. so comparison with the encrypted files on the cloud (via sync app) can happen.

--- Question Major Problem:

Considering that two copies of the same files encrypted with GPG will never look the same (see answers to this question), how can I achieve comparison between the encrypted files?

Or should my procedure be completely different?

nico
  • 341
  • 1
  • 2
  • 9
  • @Arminius: thanks I had missed that one. Edited question and title. Apparently I have a bigger problem than I thought. – nico Feb 10 '17 at 07:32

1 Answers1

2

You can't compare encrypted file contents if you're using gpg.

You have two ways forward that I can see:

  1. Build a hash (say, sha256) of each file before you encrypt it and store it somewhere. Compare the hashes instead of the files. This might become a performance bottleneck if your files are very large. You can check whole trees by building hashes of hashes kind of like git does it. I have a python backup script that does this to detect changes in a large file base; you can have the script if you're interested. You'd have to add an encryption step to it, though.
  2. Look at the modification time of the file content (this is stored by every filesystem along with the file content) to find out whether a file is newer than another one. This is incredibly fast, but you need to take care to get a working system.

However, no matter which way you go, the following problem (from your first comment) will haunt you:

But I'm still confused as to when the comparison of timestamps should happen, and between which groups of files. [the emphasis is mine]

This is a fairly difficult problem burried in every sync scenario: If you have two sets of files (e.g. two file trees), you need a way to figure out which files in tree A should be compared to which files in tree B. This isn't a problem when you just modify file contents, but what hapens when you add a few files, rename others and delete still others? You basically need an algorithm to determine the edit distance, and a tree editor (to determine and apply a small, or the smallest, set of operations that will turn tree A into tree B). I believe this is an O(n^2) problem and it's an algorithms question, not a security question (e.g. you should ask it on one of the stackexchange sister sites).

Also, if you want your solution to work for file trees, not just a single folder, I doubt you can solve this with a small batch script; like I said, I did something comparable for backup purposes and my script has grown to considerable size (e.g. thousands of lines).

Out of Band
  • 9,150
  • 1
  • 21
  • 30
  • I was thinking of going with option 2 which seems easier to achieve with simple batch scripting. But I'm still confused as to when the comparison of timestamps should happen, and between which groups of files. I have uploaded encrypted files. Some of the unencrypted files have changed. Supposing I still have the encrypted files locally, do I compare the timestamps of these two groups? But I don't think this is possible since GPG creates a new file...? – nico Feb 10 '17 at 07:45
  • 1
    Hashing is more secure and reliable. And If you threw away the originals, the timestamps will be messed up indeed. Like Pascal said, you need to take care to get this right. What I'd do is make a simple file with "last modified" timestamps that you update on every scan with something like this: http://stackoverflow.com/questions/1520643/windows-batch-file-check-if-file-has-been-modified – J.A.K. Feb 10 '17 at 08:38