12

I'm looking for a compression format that supports being tailed. Meaning you dont have to read the entire file to get the last X uncompressed bytes. Is this possible with any of the formats like bzip2, xz, lzma, etc?

I once coded something using gzip that could do this. Basically on a really high level, what it did was cat multiple gzip blocks together, then I had a util that could seek backwards from the end of the file until when the last block started. These files were fully readable by the standard gzip utilities, but I'm hoping theres something a little more standardized available.

The ultimate purpose for this is for log files which I can write out compressed, and then be able to tail them (even when they havent been fully written; i.e. streaming) without having to wait for the whole thing to be read from disk or network.

phemmer
  • 5,789
  • 2
  • 26
  • 35

2 Answers2

5

gzip has an --rsyncable option which does essentially the same. The non-standard part would be the gzip-block-aware "ztail" utility, but it seems like you've dealt with that already.

the-wabbit
  • 40,319
  • 13
  • 105
  • 169
  • 2
    Of course that option effectively limits you to something like -0.5 compression level, even if you specify -9. – psusi Sep 19 '11 at 13:59
  • where are you getting the version of gzip that supports this? If this was an option at one point, it appears to have been removed. – phemmer Sep 19 '11 at 18:13
  • The Debian-based distros do have it - this is from Ubuntu 10.04: root@backup1:~# gzip -V gzip 1.3.12 root@backup1:~# gzip -h | egrep rsync --rsyncable Make rsync-friendly archive – the-wabbit Sep 20 '11 at 12:51
  • Apparently, some other distro maintainers (e.g. Fedora) seem to have included the patches as well. And there is a patch for an oldish gzip here: http://www.samba.org/netfilter/diary/gzip.rsync.patch which might apply to a more recent version with slight modifications, if you really need to self-compile. – the-wabbit Sep 20 '11 at 13:02
  • BTW, discussions suggest that the impact on compression is rather negligible (within 2-3 %), although the mileage for a specific dataset might vary. If you need "tailable" compression with adaptive algorithms like deflate, there is hardly any way around resetting the algorithm every now and then - of course this will induce a compression efficiency hit. – the-wabbit Sep 20 '11 at 13:11
1

FWIW: I've developed a command line tool upon zlib's zran.c source code which creates indexes for gzip files: https://github.com/circulosmeos/gztool

It can make a continous tail of a gzip file with -T option. Or just a tail of last contents and stop, with -t (Many other options available).

Note that for any of these actions gztool will create an index file interleaved with that action.

Indexes can be interrupted at any time and reused and/or completed later. And as gztool can just be commanded to extract data from any place in the file, and will create the index interleaved with that action, there's never time lost when using it.

circulosmeos
  • 191
  • 1
  • 2
  • 5