recurring, queryable, cached checksumming of all files in a filesystem

Question

I'm looking for a way to efficiently manage and leverage file-level checksums for all files in a filesystem over time.

Goals:

Configurable, fast refresh - only re-checksumming large files when other criteria indicate a likely change (file size, timestamp, first and last block changed, etc.). I say "configurable" because some use cases can't trust that timestamps haven't been changed, etc.
Fast query for a specific checksum (In other words, answering the question "Do I already have this file?") across the whole filesystem
A way to compare the data across filesystems (either natively within the solution, or machine-readable export so that a comparison could be scripted)
Support for multiple hashes
Duplicate file reporting (I don't expect the solution to walk me through an interactive deduplication session; machine-readable report output would be fine)
Nice-to-have: a way to optionally (re-)generate traditional checksum files in each directory ("CHECKSUM", "MD5SUM", or similar) so that subdirectories exposed via FTP or web can easily consume the checksums

The key idea is for the hashes to be cached in such a way that they can be both quickly updated and quickly queried.

And in my ideal world, the OpenZFS project would "simply" add "lazy" (best effort) file-level checksumming to ZFS, and make the checksum values globally queryable. ;) — Royce Williams, Sep 01 '17 at 00:30
Have you looked into something like iRODS (https://irods.org/)? It provides a query interface for all the metadata stored in a database. Of course, the first step is to ingest all your files into the iRODS system and create the DB. The checksum value along with other metadata can be stored and queried. — Tux_DEV_NULL, Sep 01 '17 at 08:07

0 Answers0