Hashing a large file on an embedded system

Question

I'm trying to come up with a way to verify the authenticity of a file that I'm downloading from a server to an embedded system. I'm thinking of using a hash (SHA256 preferably). My main concern is that the file size might be too large to load it into a buffer and pass it to the hashing function.

When searching online, I came across this answer (https://crypto.stackexchange.com/a/19805) that talks about computing a hash tree instead of a hash for the whole file. Is this a valid approach? Does anyone have any advice for doing this in a different way?

score 7 · Answer 1 · answered Feb 12 '18 at 21:57

As long as you receive the file in order from beginning to end, you can calculate the hash incrementally. Any cryptography API that's suitable for embedded systems (and most of the ones that aren't) lets you calculate hash by passing successive chunks of the input. Each time you've received a chunk, feed the input to the hash calculation and append it to a file. When the transmission is over, finalize the hash calculation, and erase the file if the hash is incorrect.

Note that a hash can only verify the integrity of a file, not the authenticity. This means that the device needs to know the hash value in advance, so you need to transmit the hash through another channel. This is typically done by using a slow, but trusted channel to transmit the hash (e.g. end-to-end TLS to a trusted server), and a channel with more bandwidth but less security for the file itself (e.g. a broadcast mechanism).

A hash tree is only useful if you need to be able to verify parts of the file independently without having to retrieve the whole file. If you're getting the file sequentially and fully, it isn't useful.

My library(mbedtls) did support streaming. I was able to use that feature to generate the hash! — rorschach, Feb 13 '18 at 18:44

score 2 · Answer 2 · edited May 20 '21 at 10:49

The memory requirements for SHA-2 are independent of the size of the message.

The way most cryptographic hashes work isn't exactly intuitive. All it is is a function that starts with an initial value and takes a single input of its block size, which permutes the digest. There is no internal state for the hash, and each subsequent block simply permutes the digest a bit more. A given digest, plus a given input block, will always result in the same new digest. This is one of the ideas behind the Merkle–Damgård construction, which MD5 and SHA-2 use. The hashing operation on each input block is individual and atomic, without respect to the entire message (although the final padding block encodes the total length of the message being hashed). Hashing a file simply involves doing this repeatedly, starting with an initial value specified by the standard, and ending when there are no more blocks to be hashed. You only need to store three things to compute a SHA-2 digest:

The current digest, or the initial value. This is the size of the hash.
The next input block (512 bits or 1024 bits, depending on the hash size).
The internal state during the hashing operation, which is the same for each block.

From FIPS-180-4 § 5.2, the message is parsed by breaking it up into individual blocks:

5.2    Parsing the message
The message and its padding must be parsed into N m-bit blocks.

5.2.1  SHA-1, SHA-224 and SHA-256
For SHA-1, SHA-224 and SHA-256, the message and its padding are parsed into N
512-bit blocks, M⁽¹⁾, M⁽²⁾,..., M^(N). Since the 512 bits of the input block may be
expressed as sixteen 32-bit words, the first 32 bits of message block i are
denoted M₀⁽ⁱ⁾, the next 32 bits are M₁⁽ⁱ⁾, and so on up to M₁₅⁽ⁱ⁾.

5.2.2  SHA-334, SHA-512, SHA-512/224 and SHA-512/256
For SHA-384, SHA-512, SHA-512/224 and SHA-512/256, the message and its padding are
parsed into N1024-bit blocks, M⁽¹⁾, M⁽²⁾,..., M^(N). Since the 1024 bits of the input
block may be expressed as sixteen 64-bit words, the first 64 bits of message block
i are denoted M₀⁽ⁱ⁾, the next 64 bits are M₁⁽ⁱ⁾, and so on up to M₁₅⁽ⁱ⁾.

For SHA-1, SHA-256, and SHA-512, the internal state size is as large as the digest (the other hashes like SHA-224 and SHA-512/224 are simply truncated versions of larger versions, with different starting values). This means, for your use case with SHA-256, you only need to be able to store 256 bits (32 bytes) of data to compute the digest of data of any size, from 0 bits to 2⁶⁴ - 1 bits (this limit is due to the padding used, which encodes the size of the message in a 64-bit block). Various optimizations may increase the memory requirements, but not enough to be of a concern for an embedded system.

Some newer hash functions have an internal state larger than their output digest size in order to avoid length extension attacks, where, without knowing m and having only its digest, you can calculate the digest of m', where m' begins with the contents of m. SHA-3 (regardless of digest size) for example has a larger internal state of 1600 bits to prevent this. It ensures that any single hashing operation is dependent on the internal state, which is not revealed at the end when the final digest is released.

score 1 · Answer 3 · answered Feb 12 '18 at 21:38

1

Virtually every hash function out there, and certainly every hash function worth using (yes, that includes SHA-256), can be used in a streaming fashion. You don't need to give it the entire file to be hashed in a single chunk; instead, you can feed it the file in small pieces and get the final hash value when you're done.

answered Feb 12 '18 at 21:38

Mark

34,390
9
85
134

In fact, you can't give it in a single chunk. SHA256 requires you give it inputs in 64 byte blocks. – forest Feb 13 '18 at 02:48

Hashing a large file on an embedded system

3 Answers3

Linked