Rolling hash

A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters.

One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below. Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash. Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash. FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.

At best, rolling hash values are pairwise independent[1] or strongly universal. They cannot be 3-wise independent, for example.

Polynomial rolling hash

The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions:

where is a constant, and are the input characters (but this function is not a Rabin fingerprint, see below).

In order to avoid manipulating huge values, all math is done modulo . The choice of and is critical to get good hashing; see linear congruential generator for more discussion.

Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum by . Shifting all characters by one position to the right requires dividing the entire sum by . Note that in modulo arithmetic, can be chosen to have a multiplicative inverse by which can be multiplied to get the result of the division without actually performing a division.

Rabin fingerprint

The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2). Instead of seeing the input as a polynomial of bytes, it is seen as a polynomial of bits, and all arithmetic is done in GF(2) (similarly to CRC32). The hash is the result of the division of that polynomial by an irreducible polynomial over GF(2). It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.

Because it shares the same author as the Rabin–Karp string search algorithm, which is often explained with another, simpler rolling hash, and because this simpler rolling hash is also a polynomial, both rolling hashes are often mistaken for each other. The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between 512 bytes and 8MiB.[2]

Cyclic polynomial

Hashing by cyclic polynomial[3]sometimes called Buzhashis also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of tabulation hashing: it presumes that there is some hash function from characters to integers in the interval . This hash function might be simply an array or a hash table mapping characters to random integers. Let the function be a cyclic binary rotation (or circular shift): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., . Let be the bitwise exclusive or. The hash values are defined as

where the multiplications by powers of two can be implemented by binary shifts. The result is a number in .

Computing the hash values in a rolling fashion is done as follows. Let be the previous hash value. Rotate once: . If is the character to be removed, rotate it times: . Then simply set

where is the new character.

Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first bits. That is, take the result and dismiss any consecutive bits.[1] In practice, this can be achieved by an integer division .

Content-based slicing using a rolling hash

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network and a simple byte addition at the front of the file would cause all the fixed size windows to become updated, while in reality, only the first "chunk" has been modified.

The simplest approach to calculate the dynamic chunks is to calculate the rolling hash and if it matches a pattern (like the lower N bits are all zeroes) then it’s a chunk boundary. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but nothing else.

When the boundaries are known, the chunks need to be compared by their hash values to detect which one was modified and needs transfer across the network.[4] The backup software Attic uses a Buzhash algorithm with a customizable chunk size range for splitting file streams.[5]

Content-based slicing using moving sum

Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:[6]

where

  • is the sum of 8196 consecutive bytes ending with byte (requires 21 bits of storage),
  • is byte of the file,
  • is a "hash value" consisting of the bottom 12 bits of .

Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.

For every where , these programs cut the file between and . This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but no other chunk.

Gear Fingerprint and Content-based chunking algorithm FastCDC

The Content-Defined Chunking (CDC) algorithm needs to compute the hash value of a data stream byte by byte and split the data stream into chunks when the hash value meets a predefined value. However, comparing a string byte-by-byte will introduce the heavy computation overhead. FastCDC [7] proposes a new and efficient Content-Defined Chunking approach. It uses a fast rolling Gear hash algorithm [8], skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach. [9]

The basic version pseudocode is provided as follows:

algorithm FastCDC
    input: data buffer src , 
           data length n, 
    output: cut point i
    
    MinSize  2KB     //split minimum chunk size is 2KB
    MaxSize  64KB    //split maximum chunk size is 64KB
    fp  0
    i  MinSize
    Mask  0x0000d93003530000LL
    
    // buffer size is less than minimum chunk size
    if nMinSize then
        return n;
    if nMaxSize then
        n  MaxSize
     
    while i < n do
        fp  (fp << 1 ) + Gear[src[i]]
        if !(fp & Mask) then
            return i
   
    return i

Where Gear array is a pre-calculated hashing array. Here FastCDC uses Gear hashing algorithm which can calculate the rolling hashing results quickly and keep the uniform distribution of the hashing results as Rabin. Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed. Experiments suggest that it can generate nearly the same chunk size distribution in the much shorter time (about 1/10 of rabin-based chunking [9]) when segmenting the data stream.

Computational complexity

All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window () varies. Rabin–Karp rolling hash requires the multiplications of two -bit numbers, integer multiplication is in .[10] Hashing ngrams by cyclic polynomials can be done in linear time.[1]

Software

gollark: They're generated from another process with a debugger.
gollark: Interesting.
gollark: PRs welcome!
gollark: I was very lazy and bees bees apioids cached_message.
gollark: It won't fetch messages arbitrarily far back.

See also

Footnotes

  1. Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698–710, 2010. arXiv:0705.4676.
  2. "References — restic 0.9.0 documentation". restic.readthedocs.io. Retrieved 2018-05-24.
  3. Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997.
  4. Horvath, Adam (October 24, 2012). "Rabin Karp rolling hash - dynamic sized chunks based on hashed content".
  5. "Data structures and file formats — Borg - Deduplicating Archiver 1.1.5 documentation". borgbackup.readthedocs.io. Retrieved 2018-05-24.
  6. "Rsyncrypto Algorithm".
  7. Xia, Wen; Zhou, Yukun; Jiang, Hong; Feng, Dan; Hua, Yu; Hu, Yuchong; Liu, Qing; Zhang, Yucheng. "FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication". USENIX. Retrieved 2020-07-24.
  8. Xia, Wen; Jiang, Hong; Feng, Dan; Tian, Lei; Fu, Min; Zhou, Yukun (2014). "Ddelta: A deduplication-inspired fast delta compression approach". Performance Evaluation. 79: 258–272. doi:10.1016/j.peva.2014.07.016. ISSN 0166-5316.
  9. "The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems". IEEE Journals & Magazine. 2020-06-16. Retrieved 2020-07-22.
  10. M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.