Git Metadata Integrity Checks?

1

I was wondering today how well git ensures the integrity of its metadata and I am a bit surprised by what I encountered. I used the following simple setup for testing:

  • Two working repositories, called x and y
  • A bare repository, called xy.git

So, initially x and y are pushing to and pulling from x.git and everything works just fine. Now, let's say one of the metadata objects (.git/objects/...) in x.git becomes corrupted for whatever reason (choose your favorite random incident).

I was actually assuming that something will break at the next push or pull, but to my surprise, everything appeared to work fine. More commits, more pushing and pulling, no problems. The first time something was reported to be corrupted was when I tried to clone another working repository from the bare repository, leaving my clone in an unusable state.

Now I thought it is not that bad, because thanks to git's architecture, I can simply dump the bare repository in the worst case, and recreate it with all history from one of my working sets. But no. Without any notice, the corrupted file has made its way through the pulling into the working repositories, making it impossible to clone a new bare repository from them as well.

This happens not only when I start with a corrupted file in the bare repository, it is also possible to introduce a corrupted file from a working repository in a bare one this way.

Sure, one might be able to fix this by other means, but I'm still surprised (and a bit concerned) how easy it appears to be to mess up the repository for everyone working with it. Especially since the error can remain unnoticed until the next time someone tries to clone. Shouldn't there be checks against this, somewhere, somehow?

Anyone here willing to try if it is reproducible? I experimented with git version 2.7.4.

Any advice on how to check against such corruption is highly welcome.

Wanderer

Posted 2018-09-28T13:11:50.093

Reputation: 33

Answers

1

I was actually assuming that something will break at the next push or pull, but to my surprise, everything appeared to work fine. More commits, more pushing and pulling, no problems.

Each object – file, commit, etc. – is named after the SHA1 hash of its contents (plus a small header). Whenever an individual object is read into memory for use, the data is hashed and compared with the object's name; any mismatch will cause an error to be shown.

However, most operations don't need to read the whole repository into memory. In general all commands only read the bare minimum needed – of course, you would have noticed the problem if you tried to check out a broken commit or diff against it, but such operations as creating a commit don't care about previous objects whatsoever. Even pushing needs just a small selection of objects (as the delta base for 'thin' packs) because both peers know what the other side already has.

(This optimization is a direct result of the snapshot-based layout. For example, git add doesn't need to delta against the old files, because it simply builds up a new snapshot as it goes. Then git commit turns this snapshot into commit/tree objects without knowing anything about the previous commit except its ID.)

This happens not only when I start with a corrupted file in the bare repository, it is also possible to introduce a corrupted file from a working repository in a bare one this way.

First, keep in mind that a same-computer, same-filesystem clone doesn't pack and transfer objects – it simply hardlinks the files, in order to save both space and time. You have to explicitly opt out of this by cloning from a file:// URL instead of a simple path.

However, a clone over SSH or HTTPS (or the aforementioned file:// URLs) actually reads and writes the object data in order to build up the transfer pack, so any corrupted object that was supposed to be part of the transfer will abort the process.

If you somehow manage to push a corrupted object to a remote server – with it slipping through both the local packing and the remote unpacking – that's a bit unusual (especially after the 2013 git.kde.org story) and I'd raise that concern on the Git mailing list.

(Don't worry that the documentation talks about transfer.fsckObjects being disabled by default, – it only disables validating the object structure and syntax, not hash verification.)

Shouldn't there be checks against this, somewhere, somehow?

A full check can be done manually using the git fsck command. It's a good idea to cronjob it on your 'central' repositories. The full check is not automated because it would take an unreasonable amount of time to re-check the complete repository on every commit/push/pull/whatever for all but the smallest Git repositories.

A partial check implicitly happens whenever git decides to run the git gc --auto background maintenance process. This maintenance reads all recently created 'loose' objects and archives them into a .pack file, so verification of those objects is done for free. (However, instead of running at a preset schedule, it is run whenever you have more loose objects than the set limit.)

user1686

Posted 2018-09-28T13:11:50.093

Reputation: 283 655