Recover a bzip2 file

5

The popular .bz2 compression format is used to store all sorts of things. One of the more interesting features is that it consists of several independently decompressible blocks, allowing recovery of partial segments if the file is damaged. The bzip2recover program, included with bzip2, allows recovery of blocks from a bzip2 archive. In this challenge, you will implement a simple clone of bzip2recover.

The basic idea is to scan a bitstream for a particular magic number, and output blocks delimited by this magic number to separate files. Read on for gory details.


Technically, bzip2 files are a bitstream, with every group of 8 bits being stuffed into a byte. The first bit is placed in the most significant bit of a byte, with the eighth bit going in the least significant bit of a byte. Files begin with the header "BZh", and a digit "N" from 1 to 9 indicating the compression level (9 is the most common). Then the bitstream begins.

The bitstream is split up into blocks. Each block encodes N*100KB of uncompressed text (e.g. 900KB of text for compression level 9). Blocks can begin basically anywhere in the bitstream, and do not necessarily begin on 8-bit boundaries. Blocks begin with a magic number, which is set to the 48-bit sequence 0x314159265359 (in ASCII, "1AY&SY"). For example, the sequence of bytes "73 14 15 92 65 35 90" contains the magic number, offset by 4 bits within the byte.

Your goal will be to take a bzip2 file on stdin, and output each detected BZip2 block to a separate decompressible file. Specifically, your program does the following:

  • Read and skip the bzip2 header ("BZh" + N)
  • Scan the bitstream for the magic sequence
  • Copy the bitstream between two consecutive magic sequences to a new file, with a new BZip2 header and block header. Note that this may require realigning the bitstream to start at a byte boundary, and padding the last byte.

You don't have to care about parsing the blocks, and you can assume that a magic sequence always starts a block (i.e. that it isn't in the middle of a block).

How you name the output files is up to you. The only requirement is that they be clearly sequential somehow (so, e.g. "a b c ... z aa ab" or "1 2 3 4 5 .. 9 10 11 12" are both acceptable).


For a test case, you may use this file from the IPSC 2014 problem-solving contest. The file is a bzip2 file with four blocks. The file as a whole actually decompresses to another bzip2 file with 162 complete blocks. Therefore, you may use these block counts to validate that your solution is getting all of the blocks. (Note that the file actually decompresses even further into more bzip2 files, but you probably don't want to decompress it all the way).


This is code golf, with a twist: your score is the number of bytes in the bz2-compressed version of your source code. (I'm aware this may make submissions larger, but these are the rules!). Since bzip2 compressors may differ in performance, provide a base64-encoded version of the compressed file to prove it's size. (You may therefore be able to obtain a better score by optimizing your program for bzip2, though I don't necessarily expect anyone to do this). Standard loopholes apply.

nneonneo

Posted 2014-07-16T05:58:12.933

Reputation: 11 445

Answers

3

Python 3 -- 375(534 chars)

It can process the example file d2.in in about 10 seconds, using Pypy3. With CPython 3, it will cost 60 seconds.

It's interesting and weird that some old golf tricks would make the score worse! For example, replacing <space>+<space> by + would increase the score from 375 to 384; replacing all range by R will also increase the score to 384.

Golfed

import sys
from collections import deque
def T(B):
 for i in range(0,len(B),8):
  x=0
  for j in range(min(8,len(B)-i)): x |= B[i + j] << (7-j)
  yield x
V=sys.argv
C=F=0
B=bytearray()
I=open(V[1],'rb')
H=I.read(4)
P=deque([(x>>(7-j)) & 1 for x in b'1AY&SY' for j in range(8)])
Q=deque([0]*48)
for x in I.read():
 for j in range(8):
  b=(x>>(7-j)) & 1
  B.append(b)
  Q.popleft()
  Q.append(b)
  if Q!=P: continue
  C+=1
  M,B=B[:-48],B[-48:]
  if F:F.write(bytes(T(M)))
  F=open(V[2] + str(C),'wb')
  F.write(H)
F.write(bytes(T(B)))

Base64

QlpoOTFBWSZTWSoMWQ0AAMLfgAAQYf901zlibSo/t//kMAFaCIaCQANTTTKMnlGg8oeRpPaoNDDAAAAAaAAAAABKaSGmmpplD0jQaAABo00ZNqLfKsdHBd3QgkQ2wY3RfYpMMjPpsyit5Ee6+k6E6ObzlFT2vmgbSKPkni7ZgDkcwTRiLFBmNCuEX/HtOC8KJ+Rg51SkMCiAGXZpXbvaK+DzMLvaEhAM8reCB8HFAsNYfRghS9+i5/IpLrDzGZVhsbVOpQzHIIVZ7zCCgDkcElOMvTuvfw2bruKz2WQuvEg99gDa8T3Kw2GdyMNjEVFVVQj4sU1MaOTUcwQROwt86LfUZAOJE6HqnB60CFO38NJmpifx+trcQ8x5alqRiEGoKC2wJpPkCXpT7fbtFDbLCLQaplK23qMTC5gqGwCghjN3zEtwyKhCz3kixqBr7gya6yPG03rMOGY9hlmV3tS0W/SdC2taUpCoNSEaEnUXckU4UJAqDFkN

Ray

Posted 2014-07-16T05:58:12.933

Reputation: 1 946

Obviously, I'm not sure these will help, but here's some standard golf tricks to try out: Use from collections import*. Use semicolon separation instead of newline within a for loop in any line that does not begin with a statement (if, for, etc.). You only use P once, so don't assign it to a variable. Remove spaces around all of your operators - particularly in the first for j in... line, but also around the & in the definition of P and in the definition of b. – isaacg – 2014-07-28T03:02:16.127