Git Metadata Integrity Checks?

I was actually assuming that something will break at the next push or pull, but to my surprise, everything appeared to work fine. More commits, more pushing and pulling, no problems.

Each object – file, commit, etc. – is named after the SHA1 hash of its contents (plus a small header). Whenever an individual object is read into memory for use, the data is hashed and compared with the object's name; any mismatch will cause an error to be shown.

However, most operations don't need to read the whole repository into memory. In general all commands only read the bare minimum needed – of course, you would have noticed the problem if you tried to check out a broken commit or diff against it, but such operations as creating a commit don't care about previous objects whatsoever. Even pushing needs just a small selection of objects (as the delta base for 'thin' packs) because both peers know what the other side already has.

(This optimization is a direct result of the snapshot-based layout. For example, git add doesn't need to delta against the old files, because it simply builds up a new snapshot as it goes. Then git commit turns this snapshot into commit/tree objects without knowing anything about the previous commit except its ID.)

This happens not only when I start with a corrupted file in the bare repository, it is also possible to introduce a corrupted file from a working repository in a bare one this way.

First, keep in mind that a same-computer, same-filesystem clone doesn't pack and transfer objects – it simply hardlinks the files, in order to save both space and time. You have to explicitly opt out of this by cloning from a file:// URL instead of a simple path.

However, a clone over SSH or HTTPS (or the aforementioned file:// URLs) actually reads and writes the object data in order to build up the transfer pack, so any corrupted object that was supposed to be part of the transfer will abort the process.

If you somehow manage to push a corrupted object to a remote server – with it slipping through both the local packing and the remote unpacking – that's a bit unusual (especially after the 2013 git.kde.org story) and I'd raise that concern on the Git mailing list.

(Don't worry that the documentation talks about transfer.fsckObjects being disabled by default, – it only disables validating the object structure and syntax, not hash verification.)

Shouldn't there be checks against this, somewhere, somehow?

A full check can be done manually using the git fsck command. It's a good idea to cronjob it on your 'central' repositories. The full check is not automated because it would take an unreasonable amount of time to re-check the complete repository on every commit/push/pull/whatever for all but the smallest Git repositories.

A partial check implicitly happens whenever git decides to run the git gc --auto background maintenance process. This maintenance reads all recently created 'loose' objects and archives them into a .pack file, so verification of those objects is done for free. (However, instead of running at a preset schedule, it is run whenever you have more loose objects than the set limit.)

user1686

Posted 2018-09-28T13:11:50.093

Reputation: 283 655

Git Metadata Integrity Checks?

Answers