1

I'm building a system which involves passing around what are essentially uncompressed TARs of linux userlands including some additional private files. An original file (X) is sent to a worker, and the worker produces a new file (Y) which has a high probability of being similar to X. The workers will have limited storage available, so requiring the worker to keep a copy of X is something that I wish to avoid. I can however easily compute a message digest (eg. SHA512(X)) of X when the worker receives it.

To save upload bandwidth, I would like to allow a worker to request a pre-computed rdiff signature of X from a central store using SHA512(X) as a key, compute an rdiff delta against Y, and upload the delta to a server where an rdiff patch operation is applied to X to derive Y. (So far, much like traditional rsync.)

However, I would like to avoid tracking which workers have which original files while also not trusting all workers with all files.

rdiff computes signatures by computing a rolling hash (adler32) and a "strong" hash pair (blake2/md4) for each block of the original file. It transmits deltas by sending back references to the original blocks that stayed the same along with the additional data.

So can I assume that:

  1. the rdiff signature can confirm guessed content, but does not make it any easier to guess the contents of a block in X containing private data?
  2. the worker providing a delta and SHA512(Y) proves that the worker originally had X if rdiff patch x delta_y yields Y, provided that SHA512(X) != SHA512(Y)?
tjdett
  • 111
  • 2
  • The rsync algorithm uses a variable block size for signature, depending on size of file. Dont recall specifics, but check what it does for small files, it may expose this approach to birthday attacks. In general it optimizes for performance, not secrecy, so there may be other weaknesses. – Jonah Benton Aug 29 '16 at 12:01
  • The need for which this approach may be suitable is not clear, in particular the selection of a method that mixes shared/common data with private data. This can introduce other weaknesses. May be worth reframing the question around the need rather than around a solution, as other approaches may emerge. – Jonah Benton Aug 29 '16 at 12:05

0 Answers0