1

I have a large number of systems (100s) managed by a small group of people which has changed over time. Each system is installed using a base image (which has its own version which is different depending on the age of the installation) which is then customised over time (forked) in various ways according to the needs of the client.

I have a copy of each version of the install image. Over 90% of the install image is the same between versions. Customisations are usually less than 3%.

I need to find out which versions are installed and what customisations have been made since the install.

Due to bandwidth constraints, I can't do a network diff or an rsync --dry-run over the network*.

However, I envisage being able to run a script over each install image, and send that as a database to each system to compare with its own filesystem and report back - like a "fingerprint", if you will.

The "fingerprint" (filesystem tree + checksum for each file & folder) would be limited to the fileset that are modifiable (and not /proc, /sys, /tmp, pipes, sockets, etc.).

The "fingerprint" can't be an MD5 of the filesystem because one change would result in a different fingerprint and we can't be sure which files may have been customised.

I'm looking for a utility that will report 2 things:

  1. Suggest which version best matches the filesystem as it currently stands from a database of filesystem "fingerprints" (metadata of tree structure + file & folder checksum), and
  2. List which files/folders have changed (customised) from that version, including new files and deleted files.

Additionally, it would be good if I could create new databases from existing ones so that I can take information from customisations to make new versions (e.g. Version 2.0.3-withmodX).

I've considered:

  • Backup utilities - they presume that versions have a 1:1 linear progression per client
  • Image management systems - tend to presume that images go server->client with only known customisation (e.g. new files, specific config folders), where we want information where client (references database)->server.

I could, perhaps, use git in some way to generate a '.git' database of the filesystem and then send multiple .git databases to compare against, then:

  1. Least number of git status lines = version.
  2. git status output against version = customisations.

Is there such a "fingerprint"-ing utility for filesystems or is there some utility that will make this easier to build?

*although I'm wondering if rsync can output a database of meta-information which could be used to build such a tool easily.

1 Answers1

2

You want to describe the ancestry of hundreds of disk images, identify arbitrary fuzzy changes, and are bandwidth limited? Tricky.

Previously on Server Fault, comparison of disk images brought up cmp and rsync. I'll add virt-diff, and VCS (probably git). You won't like any of them.

Checksum on a disk image (sha256sum, md5sum) you discounted as you want to know a file diff. Still a useful identifier for an image once you identify which exact one you want.

UUID and any label on a file system is visible with lsblk --fs. Useful for identifying origin, but not any changes. However, I will wager neither were changed when the system was installed.

cmp on disk images is a byte comparison of the file system. You won't see file level diffs. Changes as minor as churn in /tmp will make every image different.

rsync on mounted file systems will show changed files. It also will do a stupid amount of I/Os, a typical Linux root fs will have hundreds of thousands of inodes. You don't have the IOPS to find the delta with hundreds of other file systems, not on systems in use.

virt-diff will find differences in files in disk images. You would reference a disk image or snapshot not in use, such as a full backup on a secondary server. This backup is bandwidth limited, not IOPS limited. However, you said you were bandwidth limited.

VCSes like git were not designed to preserve arbitrary system files including permissions and special files. etckeeper has hacks to do so. VCSes also are less useful when the ancestry is not known, their data structures follow how the user has branched.

You can do a deduplication report on arbitrary objects in git repos by looking at packfiles. Problems here are tooling and scale. verify-pack is a low level plumbing command, not easy to use for this purpose. Doing this on a per file level would be analyzing millions of blobs, not scalable. Even looking at how disk images as blobs are packed would get slow.


I suggest to forget the automatic script and have a person do it.

Identify useful images out of the base and customized ones. Use cases that are worth keeping around as base images.

Set and document unique UUIDs and labels on these. Checksum and archive the images for future use.


Not directly related, but in the future try separating system package state and user data.

Consider a read only root, with configuration and data being different file systems or overlays. Possibly /home on NFS or /tmp on tmpfs. The base image is trivial to identify as it is untouched. Changes to the image can be a defined process: mount r/w, make changes, snapshot.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Heh. Yes, we have a read-only root and, in most cases, separate software, config, and data. The problem is that these always needed to be switched off to make changes so, while they protected us from unwanted changes, they didn't protect us from on-the-fly developers. These days I'd use a CoW filesystem, but these installations predate CoW. – tudor -Reinstate Monica- Oct 17 '19 at 04:19
  • Then this is more of a process question than a technical one. Have a way for users to submit disk images, or ideas for how to change them that someone can script and snapshot. – John Mahowald Oct 17 '19 at 12:52
  • Yes, but to do that one still needs a technical "fingerprinting" solution. And that only solves new cases, not the current ones. – tudor -Reinstate Monica- Oct 20 '19 at 00:12