1

I am writing a cloud backup system, and want to use a checksum to know if a file has been modified, and accordingly sync it with the server.

This question shows that xxHash is super fast, and this shows that it is significantly worse in terms of collisions (albeit this uses 64bit and I plan on using 128bit xxHash) is it safe to use xxHash for this use case? should I be worried about modifications not syncing in an edge case due to collisions? should I use other algorithms like cityhash or murmur or stick to MD5 or CRC?

Note: My input files can be of any size, and I need decent performance if a large input file (many GB large) is given.

t348575
  • 113
  • 2
  • 1
    Considering that more and more systems do provide SHA-1 hardware implementations (e.g. Intel CPUs ) I would also consider SHA-1 if 160bit is not too large for you. BTW: If you want a really fast hashing algorithm combine a custom hash algorithm with a merkle tree. That allows you to perform hashing in parallel, which will speed-up hash generation for large files on multi-core CPUs and you could even use it for block based detection of changes (only upload changed sections). – Robert May 13 '22 at 17:06

1 Answers1

2

You need to consider what you need out of a hash function and what the consequences of a collision are. If a collision means that the file is not backed up when it should be, then the user will experience data loss, and it's probably not going to be acceptable to allow any collisions at all.

In such a case, you want a fast cryptographic hash. Where hardware permits it, SHA-256 is extremely fast and is cryptographically secure, which means finding collisions is functionally impossible. However, older Intel machines don't support hardware acceleration for SHA-256, so I'd personally recommend BLAKE2b, which is extremely fast even in plain C and also cryptographically secure. If you have a suitable implementation, you could even use BLAKE3, which is even faster and cryptographically secure, although it is less conservative, so it isn't my preferred choice.

You should not use MD5 or SHA-1 for this purpose. Both have known collisions and aren't secure, and BLAKE2b (as well as BLAKE3) is faster than both. It's grossly irresponsible to use MD5 or SHA-1 in new software.

bk2204
  • 7,828
  • 16
  • 15