calculate checksum of every file that is being written to HDD

1

2

Is it possible to automatically calculate checksum of every file that is being written to HDD? My OS is Linux. I've read that btrfs stores some kind of checksums for files. Would be possible to dump those checksums? How about other file systems?

Wakan Tanka

Posted 2016-01-20T20:42:41.063

Reputation: 609

Answers

2

Btrfs, ZFS and on Windows ReFS are among the major file system offerings that offer built-in data integrity checking as a feature. This is accomplished by calculating a checksum during writes, and storing that checksum along with the data. The physical storage of the checksum is usually in a different disk location, to avoid local errors from corrupting both the data and the checksum as well as allowing a failed or mis-aligned write to be detected (where the drive reports the write as successful, but it didn't "stick" or data was written in the wrong physical location).

However, this feature doesn't work exactly like you think. In short, ZFS works at the block level, and other file systems are similarly designed. This avoids the overhead of having to rewrite (or recompute the checksum over the entirety of) a large file for a trivial change; rather, just the changed blocks need to have their integrity data recalculated. With large files where small, in-place changes are common, such as VM disk images, this boils down to a very noticable difference. Fixed block sizes are basically becoming a thing of the past at this point; I don't know about the others, but ZFS uses variable block size of anything from a sector (normally 512 or 4096 bytes) up to a few hundred kilobytes to a megabyte. With a file system based on block-level data integrity checking, these file chunks are the best you could hope to be able to extract checksums of. And let's not even get into the question of, for example, deduplicated data storage...

Your question is similar to Is it possible to access the ZFS checksums to compare files on Server Fault, and while your question covers more file systems than that specific one, I believe the answer by jlp applies anyway:

I don't believe it is possible to extract the block level checksums from a ZFS filesystem, but since the checksums are at the block level, not the file level, it probably wouldn't help you anyway.

That's not to say what you are looking for cannot be accomplished. In fact, with what's available on Linux, one could probably cobble together a solution using tools like inotify and your checksum calculation program of choice to calculate checksums of files whenever they are written to. Windows offers similar programming interfaces that can almost certainly be pressed into service. This should be equally doable on top of any file system, because you're basically just tapping into the ordinary I/O workflow, not altering the on-disk data by any special means. (You would have to exclude the file that you use to store the checksums from this, obviously.)

That by itself, however, only gets you half way. The real killer feature about file systems that do data integrity checking isn't that they calculate the checksum on writes; it's that they do so because it enables them to automatically and forcibly verify the checksum on reads. That way, you can be certain that you either get valid data back, or an I/O error; anything short of perfection will make the computer loudly proclaim that there is a problem with your storage and/or use redundant data to fix it on its own. Since this is done at the file system level by the operating system, the only way to get around it is to deliberately read the disk directly, bypassing the file system layer entirely; almost no user-space software does this. (Defragmenters and file system integrity checkers come to mind as two major categories of software that have reason to. It's also worth noting here that at least for ZFS, I'm not aware of any commonly available data recovery software that can work with a ZFS pool that the ZFS tools themselves for whatever reason cannot import. The ZFS tools have some options geared toward attempting recovery of un-importable pools, but if those fail, you may very well be out of luck.)

A more practical solution to file integrity checking, if you don't want to go all-out with something like ZFS, Btrfs or ReFS, or if you really need whole-file checksums, or if you need to detect directory content changes, is a tool like hashdeep, which can be used to calculate and validate hashes on an entire directory tree. In the words of that project's official web site:

hashdeep is a program to compute, match, and audit hashsets. With traditional matching, programs report if an input file matched one in a set of knows or if the input file did not match. It's hard to get a complete sense of the state of the input files compared to the set of knowns. It's possible to have matched files, missing files, files that have moved in the set, and to find new files not in the set. Hashdeep can report all of these conditions. It can even spot hash collisions, when an input file matches a known file in one hash algorithm but not in others. The results are displayed in an audit report.

As is pointed out in the snippet above, a tool like hashdeep also has the benefit of being able to detect files that, for example, have been deleted through normal means. This is something that file-system level data integrity checking simply cannot do, and which in some situations is highly useful as a feature.

a CVn

Posted 2016-01-20T20:42:41.063

Reputation: 26 553

5

With BTRFS, Just couple of days back, I sent a patch to dump csums http://www.spinics.net/lists/linux-btrfs/msg51256.html you can download the patch and apply it.Let me know, If you run into any issues.

Usage:

btrfs inspect-internal dump-csums /btrfs/50gbfile /dev/sda4
csum for /btrfs/50gbfile dumped to /btrfs/50gbfile.csumdump

See it in action here

Edit: Latest patch can be found here: https://patchwork.kernel.org/patch/9696379/ with slight cli change. It uses "btrfs inspect-internal dump-csum" instead of "dump-csums"

btrfs inspect-internal dump-csum /btrfs/filepath /dev/name

lakshmipathi

Posted 2016-01-20T20:42:41.063

Reputation: 299

dump-csums should be dump-csum – Jodka Lemon – 2017-06-10T12:39:48.943

You may want to follow with your patch on : https://www.spinics.net/lists/linux-btrfs/msg41687.html

– Grzegorz Wierzowiecki – 2017-09-04T16:45:50.390