For small files hashing is just ok, but with huge ones you can easily find md5sum
is CPU bound. Is there any hashing algorithm able to scale out on multiple cores? Any workarounds? Ideas? Anything? :)
- 9,171
- 2
- 24
- 50
-
1_Huges ones_ is plural and can be scaled out to multiple cores by hashing more than one file at a time. One way to do that in the shell is using [GNU Parallel](http://www.gnu.org/software/parallel/) – Brian Jun 26 '16 at 16:27
-
did you tried fciv.exe in windows to compare if better suited to fit on multicore cpu ? (https://support.microsoft.com/en-ca/kb/841290) – yagmoth555 Jun 29 '16 at 12:54
-
@yagmoth555, nope. I rarely use Windows, mostly I don't use it, I'd say. From the description it looks unlikely it scales on SMP. – poige Jun 29 '16 at 13:37
-
In my experience on hashing, bound is disk IO. At least for desktop. Besides of it, in big task usually many files can be hashed. So some files can be hashed in parallel. – mmv-ru Jun 29 '16 at 00:08
7 Answers
Mine own best at the moment solution is:
parallel --block=512M --pipepart -a …HUGEFILE… --progress --recend '' \
-k -j …NUMofProcessesSay4… md5sum | md5sum
— It should be noted that:
- Resulting md5 hash isn't of the file, but rather of md5s of its parts but still it allows you to compare if replica is identical to origin
- It also doesn't perform very good, specially when you use
pipe
and not file as input parallel
's--pipepart
as I found out doesn't support disk partitions
So I'd love to hear other ways around as well.
- 9,171
- 2
- 24
- 50
-
3Git version of GNU Parallel now supports block devices (on GNU/Linux at least). Thanks for the idea. – Ole Tange Jul 04 '16 at 02:40
-
This should have been put as part of the question rather than as an answer. – mc0e Jul 05 '16 at 21:19
Unfortunately, MD5 is a linear process where its state depend on all the previous input. In other words, you can't truly parallelize it. Moreover, I'm not aware of any real hash alg that does not operate in this manner.
What you can do (and, based on your answer, you are doing) is to split the source files and concurrently calculate each chunk's md5sum.
If you can't/wan't do that, you had to use a faster hash function as xxHash, CityHash or SpookyHash
Other idea (maybe it is applicable to your intented usage): if you need something faster than MD5 (albeit single-threaded), you can use CRC32 (which is hardware-accelerated by recent CPUs) for a first fast pass, resorting to MD5/SHA1 for a second pass on seemly identical files.
- 44,038
- 6
- 98
- 162
-
The most valuable part of your answer (others is just boring repeating) is the list of possibly faster hashes. Thanks anyways. – poige Jun 29 '16 at 13:39
-
BTW: http://stackoverflow.com/questions/10726325/can-md5-be-broken-up-to-run-across-multiple-cores-threads#comment13932768_10726325 – poige Jun 29 '16 at 14:07
-
@poige Did you read the other comments in the thread you linked? In the context of a single-input hashing, MD5 can not be parallelized, simply because it is a linear process (ie: current state depends on previous inputs). – shodanshok Jun 29 '16 at 14:55
-
-
@shodanshok the issue with the second approach is that for a valid matching of a large directory with many large files, is that if all files are identical, you are adding a lot of overhead and still running md5sum on every file. – Jedi Jul 02 '16 at 16:54
-
@Jedi sure, it is only a quick workaround for the non parallelizable nature of MD5. It many files are identical, you may end up with slowdown. – shodanshok Jul 02 '16 at 18:51
There's pretty much no getting around processing the entire file. MD4 or CRC32 are probably your best bets for a widely deployed and fast algorithm (though CRC32 is going to be far less effective than MD4).
Testing various implementation of your algorithm of choice will help. If you can find a well-tested asm implementation, it will likely improve upon the performance of its C/C++ cousins.
If you don't really care about interoperability, hashing across multiple cores is easily doable by splitting the file into chunks (doesn't need to be done on-disk, you'd just start reading from specific offsets) and processing each chunk separately (this will result in serious disk thrashing though, degrading the performance, especially for mechanical disks). You'll end up with separate hashes for each chunk (though this has other advantages, like pointing you toward the broken chunk) but you could always hash them together for one final value.
This Gist could be a good start for something in Python.
- 224
- 2
- 6
-
`split` does indeed comes to one's mind, but, alas when we talk about huge files (as we do) it's not an option. – poige Jun 26 '16 at 13:24
-
@poige As I said, you wouldn't do it on disk, just start hashing the file from specific offsets and stopping at the start of the next chunk. – Gary Jun 26 '16 at 13:27
-
-
@poige From your question I can't guess at why that approach would be impractical. Perhaps there's a constraint you forgot to include? – Gary Jun 26 '16 at 14:33
-
2I didn't say it's impractical; but it's not practical cause you don't have anything in your answer that can be used right away. For e. g.: this is practical answer: http://serverfault.com/questions/488486/debian-port-80-is-blocked-but-i-dont-know-by-what/488491#488491 – poige Jun 26 '16 at 14:46
-
Most of the answers here have addressed the linear nature of most hashing algorithms. Although I'm sure there exists some true scalable hashing algorithms, an easier solution is to simply split the data up into smaller pieces, and hash each individually.
Consider the BitTorrent approach: When a Torrent is created, all of the files are split into 'blocks', each block individually hashed, and each of those hashes recorded in the .torrent file. This is what allows a peer to incrementally verify incoming data, without having to wait for the entire file to finish downloading first. Errors can also be corrected on a per-block basis, instead of requiring re-transmission of the entire file. Aside from the logistical benefits, this approach also allows hashing to scale across multiple cores - if 8 cores are available, 8 blocks can be simultaneously hashed.
If you engineer your verification process to work on some subset of the data, e.g. blocks of some fixed size, you can hash each block on a separate core, thus eliminating a large amount of delay in the pipeline. Obviously, this approach has a small time/memory trade-off: each additional instance of hashing has some overhead associated with it, mostly in the form of memory, although this is minimal unless you're running hundreds of instances.
- 1,442
- 1
- 12
- 29
I'm working on a tree hashing project, which is designed for exactly this problem: off-the-shelf parallel hashing of large files. It works now, though it hasn't been reviewed, and there's a good chance that changes from review will result in changes to the final digest. That said, it's very fast: https://github.com/oconnor663/bao
- 591
- 1
- 6
- 9
You can use md5deep for this and hashdeep for other hashes. It supports multi threading with the -j
flag. By default it will create a hashing thread for each core. It also has a flag to break files into pieces before hashing but will not use multiple threads on a singe file. I've used this for getting sha256 of half a million files and it worked great. It also has a recursive flash which makes handling large directory trees easier.
Here is the manpage for it http://md5deep.sourceforge.net/md5deep.html and git repo https://github.com/jessek/hashdeep
The package name in ubuntu and debian is md5deep and includes hashdeep.
- 21
- 4
-
1From the man page I'd expect `-j` to support thread per each file given, not its parts. – poige Jul 03 '16 at 21:28
It's easy to design a hashing algorithm that is scalable over multiple cores, it's just that the best known hashing algorithms tend to be designed specifically to prevent that, in order that tasks like finding hash collisions are made as slow as possible.
Hashing functions that don't force serial processing may suit you, but that depends what properties you expect of your hashing function. As such, I don't think that you've given enough information for a good recommendation to be made.
As others have suggested, you can construct a hashing function as the hash of the concatenated hashes of each of the blocks of some given size in the original. So long as the block size is large enough to make it difficult to reverse the hashes of individual blocks, this is likely to work well enough for most purposes. How big that should be depends on how predictable the content of those blocks is. If you are able to estimate the entropy, and choose a block size such that you get 128+ bits of entropy per block, that should be sufficient for most purposes (and overkill for many where security is not the primary concern).
From a security point of view, you are concerned about the degree of entropy at the block level, because otherwise finding a collision for a single block is enough to enable a malicious actor to substitute part of the content and get the same final hash.
It's perhaps worth noting that having a fixed block size means that the main weakness of MD5s is irrelevant - the hacker cannot append extra data to the block.
If your needs are about preventing naturally occurring hash collisions as opposed to malicious ones, then you can no doubt afford to use a much faster checksum function. Cryptographically secure hashes are typically designed to be slow to calculate.
A function from the skein function group using the optional hash tree mode might suit you. Then again, CRC32 might be all you need.
- 5,786
- 17
- 31
-
I could throw in some stuff about the controversy around SHA-3 (i.e.Keccak) if you like, but perhaps you should just read it on wikipedia. – mc0e Jul 04 '16 at 07:59