Thwart LZMA2 compression

11

Goal

Create a program or pair of programs that collectively disrupt and fix files with the intent of preventing LZMA2 from working effectively. The disrupt and fix routines must be reciprocal, so you can recover the original file exactly.

Targets

Compression Methods

  • Ubuntu/related: xz -kz5 <infile>
  • Windows: 7z.exe a -txz -mx5 <outfile> <infile>
  • Other: Use a LZMA2 compressor with compression level 5 that compresses the works of Shakespeare to 1570550 bytes &pm; 100 bytes.

Scoring; sum of (everything is in bytes, ls -l or dir it):

  • Program(s) size (whatever it takes collectively to reversibly "break"/fix the file)
  • Difference in size (absolute) between:
    • Raw collected works of Shakespeare and your modified (uncompressed) copy.
    • Raw photo and your modified (uncompressed) copy.
  • Difference in size or 0, whichever is greater between:
    • Raw collected works of Shakespeare minus your modified, LZMA2 compressed copy.
    • Raw photo minus your modified, LZMA2 compressed copy.

Example

Poorly scoring, lazily-golfed, but compliant Python 2.x example:

import sys
x = 7919 if sys.argv[1] == 'b' else -7919
i = bytearray(open(sys.argv[2], 'rb').read())
for n in range(len(i)):
    i[n] = (i[n] + x*n) % 256
o = open(sys.argv[2]+'~', 'wb').write(i)

Running...

$ python break.py b pg100.txt 
$ python break.py f pg100.txt~ 
$ diff -s pg100.txt pg100.txt~~
Files pg100.txt and pg100.txt~~ are identical
$ python break.py b Glühwendel_brennt_durch.jpg 
$ python break.py f Glühwendel_brennt_durch.jpg~
$ diff -s Glühwendel_brennt_durch.jpg Glühwendel_brennt_durch.jpg~~
Files Glühwendel_brennt_durch.jpg and Glühwendel_brennt_durch.jpg~~ are identical
$ xz -kz5 pg100.txt~
$ xz -kz5 Glühwendel_brennt_durch.jpg~
$ ls -ln
-rw-rw-r-- 1 2092 2092     194 May 23 17:37 break.py
-rw-rw-r-- 1 2092 2092 1659874 May 23 16:20 Glühwendel_brennt_durch.jpg
-rw-rw-r-- 1 2092 2092 1659874 May 23 17:39 Glühwendel_brennt_durch.jpg~
-rw-rw-r-- 1 2092 2092 1659874 May 23 17:39 Glühwendel_brennt_durch.jpg~~
-rw-rw-r-- 1 2092 2092 1646556 May 23 17:39 Glühwendel_brennt_durch.jpg~.xz
-rw-rw-r-- 1 2092 2092 5589891 May 23 17:24 pg100.txt
-rw-rw-r-- 1 2092 2092 5589891 May 23 17:39 pg100.txt~
-rw-rw-r-- 1 2092 2092 5589891 May 23 17:39 pg100.txt~~
-rw-rw-r-- 1 2092 2092 3014136 May 23 17:39 pg100.txt~.xz

Score

  • = 194 + abs(5589891 − 5589891) + max(5589891 − 3014136, 0) + abs(1659874 − 1659874) + max(1659874 − 1646556, 0)
  • = 194 + 0 + 2575755 + 0 + 13318
  • 2,589,267 bytes. Bad, but doing nothing to the files yields a score of 4,635,153 bytes.

Clarification

This is golf, so you are trying to minimize your score. I'm not sure if the comments are point out a legitimate hole in my scoring or if they are because I made it too complicated. In any case, you want the SMALLEST:

  • source code
  • difference between the uncompressed modified file and original file (e.g. if you modify it by appending a trillion 0's on the end, your score just went up a trillion bytes)
  • difference between the compressed modified file and original file (e.g. the more incompressible the files become, the higher your score). A perfectly incompressible file that grows slightly or not at all will score 0.

Nick T

Posted 2014-05-23T22:53:03.150

Reputation: 3 197

2The trolling answer: Step 1 - work out how much free disk space you have then divide that by the size of the file to get N. Step 2 - append the file to itself N times and append the number N. Step 3 - realize there's no space left to compress the file but end up with an absolute difference in filesizes of several terrabytes (or more).... [To reverse, read N from the end of the file and shrink the file to 1/Nth the size.] – MT0 – 2014-05-23T23:32:22.923

@MT0: Ah I think the solution is the differences should not be absolute. If your modified file is larger that should subtract points. – Claudiu – 2014-05-23T23:35:03.260

@MT0 if you modify the file to make it a terabyte large, then your score will be 1 terabyte...pretty bad when you're trying to golf. – Nick T – 2014-05-24T00:55:45.923

@MT0 I added a clarification to the post, does that help? – Nick T – 2014-05-24T01:05:03.757

Oh that makes sense now. To be honest for some reason it is a strange scoring system. But anyways I have a good idea on how to do it, will post in a few hours – Claudiu – 2014-05-24T01:12:44.307

2One quibble. The compressor might make a larger file if t is especially incompressible. In this case you should be rewarded, not punished, no? – Claudiu – 2014-05-24T01:14:06.063

@Claudiu yes, I'm not sure how to word it to avoid abuse then where you could simply inflate the size of your modified file to get deductions. If you compare just the delta of compressed modified to noncompressed modified where an incompressible file grows by lets say 1%, then if you inflate it to 1 TB you get a -10 GB score. – Nick T – 2014-05-24T02:52:41.570

@NickT Instead of giving higher score a higher compressed size than original should give 0 and if someone tries to accomplish that by adding to the size before compression your score system will penalize that already, wouldn't it? – Sylwester – 2014-05-25T01:34:40.277

@Sylwester true, I'll just cut it off at 0. Edited accordingly – Nick T – 2014-05-25T01:46:13.987

Answers

8

Python, score = 120

import sys,hashlib
i=0
for c in sys.stdin.read():sys.stdout.write(chr(ord(c)^ord(hashlib.md5(str(i)).digest()[0])));i+=1

Makes a one-time pad using md5 in counter mode. xors the file with it. This has the advantage that the original and disrupted files are the same size, and that the disruptor and fixer are the same program.

The compressed disrupted files are larger than the originals.

Keith Randall

Posted 2014-05-23T22:53:03.150

Reputation: 19 865

I adjusted scoring so if the zipped files are larger than their original counterparts you are not penalized and they just score 0. Not sure which way the difference was for your files but you may wish to update the score – Nick T – 2014-05-25T01:47:34.323

@NickT: updated. – Keith Randall – 2014-05-25T16:28:05.873

8

C, 51=51+0+0+0+0

main(c){for(;c=~getchar();putchar(~c^rand()>>23));}

Under the golf tricks, this program loops for each byte in standard input, and does exclusive-or with an infinite pad from rand(). I tested this with rand() in libc of OpenBSD 5.5.

Usage:

./scramble <orig >scrambled
./scramble <scrambled >orig.copy

To test my program, I wrote a shell script test.sh (57 lines) to compile my program and calculate my score.

$ sh test.sh
[1/4] Compiling scramble...
/tmp//ccbcB43x.o(.text+0x6): In function `main':
: warning: rand() isn't random; consider using arc4random()
[2/4] Scrambling files...
[3/4] Compressing scrambled files...
[4/4] Checking descrambler...
SCORE: 51=51+0+0+0+0
You may wish to rm -rf tmp.9Tjw89dgCs
$ ls -l tmp.9Tjw89dgCs/
total 43032
-rw-r--r--  1 kernigh  kernigh  1659874 May 28 17:23 filament.jpg.cp
-rw-r--r--  1 kernigh  kernigh  1659874 May 28 17:23 filament.jpg.sc
-rw-r--r--  1 kernigh  kernigh  1660016 May 28 17:23 filament.jpg.sc.xz
-rw-r--r--  1 kernigh  kernigh  5589891 May 28 17:23 pg100.txt.cp
-rw-r--r--  1 kernigh  kernigh  5589891 May 28 17:23 pg100.txt.sc
-rw-r--r--  1 kernigh  kernigh  5590232 May 28 17:23 pg100.txt.sc.xz
-rwxr-xr-x  1 kernigh  kernigh     8564 May 28 17:23 scramble

Notes about rand() and the right shift

Any compression algorithm can't compress random data. I can disguise pg100.txt and filament.jpg as random data if I scramble them with a stream cipher.

My first idea was to exclusive-or plaintext with pad to make ciphertext, then store both ciphertext and pad in the scrambled file. This would increase the size of the file, and increase my score. The obvious choice is to use the same pad for every file, and store only ciphertext in the scrambled file. If I just call rand(), it uses a default seed of 1 and makes the same pad every time.

OpenBSD 5.5 defines rand() in stdlib.h and rand.c:

/* from stdlib.h */
#define RAND_MAX    0x7fffffff

/* from rand.c */
static u_int next = 1;

int
rand_r(u_int *seed)
{
    *seed = *seed * 1103515245 + 12345;
    return (*seed % ((u_int)RAND_MAX + 1));
}

int
rand(void)
{
    return (rand_r(&next));
}

This is a linear congruential generator. The big flaw is that low bits have short periods. The 1st bit has a period of 2: if you flip a coin with rand()&1, it would go heads, tails, heads, tails, and so on. The nth bit has a period of 2n. There are 31 bits, so the whole sequence has a period of 231.

LZMA2 can find patterns in short periods and compress them. The shortest code ~c^rand() takes the low 8 bits and does not prevent compression. The right shift in ~c^rand()>>9 helps, but not enough. I use ~c^rand()>>23.

  • ~c SCORE: 4227957=40+0+0+4019391+208526
  • ~c^rand() SCORE: 2474616=47+0+0+2463735+10834
  • ~c^rand()>>9 SCORE: 350717=50+0+0+350667+0
  • ~c^rand()>>23 SCORE: 51=51+0+0+0+0

kernigh

Posted 2014-05-23T22:53:03.150

Reputation: 2 615

5

BrainFuck: 129 (129+0+0+0+0)*

random.bf (linefeeds added for readability)

,+[->>>>++[<++++++++[<[<++>-]>>[>>]+>>+[-[->>+<<<[<[<<]<
+>]>[>[>>]]]<[>>[-]]>[>[-<<]>[<+<]]+<<]<[>+<-]>>-]<[-<<+
>>]<<.,+[->]>>>]]

To create unrandom.bf you need to change the last + in the second line.

Most of the code is based on Daniel B Cristofani's Rule30 based random number generator adapted to add the number to each input and to terminate when there is no more input.

*I've tested the bytes it has processed so far 212992 (processed after 12 hours) and both files turns into a 213064 compressed file. I guess it might be done by the end of the week for know for sure but I don't want to wait with posting. I'll update the score if it's wrong, but keep the solution since Rule30 rocks!

Trivia: Rule 30 was discovered by Stephen Wolfram in 1983 and according to Wikipedia it's used to produce random integers in Mathematica.

Compilation and running:

It uses exponential time and space (iterates over 32 more cells per char processed) so it requires a BrainFuck runtime that has at least 178,876,517 cells to encode the Shakespear file, do not treat non ascii as unicode, has wider than 8 bits cells and uses -1 as eof (to differ between 255 and -1). I usually use others peoples interpreters but this time I need to be a plug and promote my own:

jitbf --eof -1 -b 16 -c 200000000 random.bf < pg100.txt > pg100.txt.ran
jitbf --eof -1 -b 16 -c 200000000 random.bf < Glühwendel_brennt_durch.jpg > Glühwendel_brennt_durch.jpg.ran

jitfb compiles BrainFuck to optimized C and abuses perl Inline::C to run it. It's bundles with my Extended BrainFuck compiler. With the cell size and width in the argument it will allocate about 400MB.

Sylwester

Posted 2014-05-23T22:53:03.150

Reputation: 3 678

3

CJam, 22 bytes

G,~q{5$H$+255%_@^o}/];

This uses a lagged Fibonacci generator with recurrence relation sn = (sn-5 + sn-16) % 255 (which I selected by mistake, but it works nevertheless) and a trivial seed to generate a pseudo-random stream of bytes, which it then XORs with the input.

I've tested my code with CJam 0.6, which was published on May 1, 2014.

How it works

G,~                    e# Dump 0, 1, ... and 15 on the stack.
   q                   e# Read from STDIN.
    {             }/   e# For each character in the input.
     5$H$              e# Copy the sixth and 19th element from the stack.
         +255%         e# Push their sum modulo 255.
              _@       e# Duplicate and rotate the character on top.
                ^o     e# XOR and print.
                    ]; e# Clear the stack.

Score

$ LANG=en_US
$ alias cjam='java -jar /usr/local/share/cjam/cjam-0.6.jar'
$ cjam thwart.cjam < pg100.txt > pg100.txt~
$ cjam thwart.cjam < pg100.txt~ > pg100.txt~~
$ diff -s pg100.txt pg100.txt~~
Files pg100.txt and pg100.txt~~ are identical
$ cjam thwart.cjam < Gluehwendel_brennt_durch.jpg > Gluehwendel_brennt_durch.jpg~
$ cjam thwart.cjam < Gluehwendel_brennt_durch.jpg~ > Gluehwendel_brennt_durch.jpg~~
$ diff -s Gluehwendel_brennt_durch.jpg Gluehwendel_brennt_durch.jpg~~
Files Gluehwendel_brennt_durch.jpg and Gluehwendel_brennt_durch.jpg~~ are identical
$ xz -kz5 pg100.txt~ Gluehwendel_brennt_durch.jpg~
$ wc -c thwart.cjam pg100.txt* Gluehwendel_brennt_durch.jpg*
      22 thwart.cjam
 5589889 pg100.txt
 5589889 pg100.txt~
 5589889 pg100.txt~~
 5590232 pg100.txt~.xz
 1659874 Gluehwendel_brennt_durch.jpg
 1659874 Gluehwendel_brennt_durch.jpg~
 1659874 Gluehwendel_brennt_durch.jpg~~
 1660016 Gluehwendel_brennt_durch.jpg~.xz
28999559 total

Dennis

Posted 2014-05-23T22:53:03.150

Reputation: 196 637

3

PHP, 117 + 0 + 0 + 0 + 0 = 117

Because would you really entrust the task of mangling your data beyond recognition to any other language?

<?=substr(gmp_export(gmp_invert(2*gmp_import($s=stream_get_contents(STDIN))+1,$m=2*gmp_pow(256,strlen($s)))/2+$m),1);

While all the other solutions are based on “secure” constructions like “random number generators” or “military-grade cryptography”, this one simply interprets strings as representing odd numbers modulo 2⋅256^length, and computes their modular inverse.

Demo:

$ php thwart.php < 100.txt.utf-8 > 100.txt.utf-8~
$ php thwart.php < 100.txt.utf-8~ > 100.txt.utf-8~~
$ diff -s 100.txt.utf-8 100.txt.utf-8~~
Files 100.txt.utf-8 and 100.txt.utf-8~~ are identical
$ php thwart.php < Glühwendel_brennt_durch.jpg > Glühwendel_brennt_durch.jpg~
$ php thwart.php < Glühwendel_brennt_durch.jpg~ > Glühwendel_brennt_durch.jpg~~
$ diff -s Glühwendel_brennt_durch.jpg Glühwendel_brennt_durch.jpg~~
Files Glühwendel_brennt_durch.jpg and Glühwendel_brennt_durch.jpg~~ are identical
$ xz -kz5 100.txt.utf-8~ Glühwendel_brennt_durch.jpg~
$ wc -c *
 5589889 100.txt.utf-8
 5589889 100.txt.utf-8~
 5590232 100.txt.utf-8~.xz
 5589889 100.txt.utf-8~~
 1659874 Glühwendel_brennt_durch.jpg
 1659874 Glühwendel_brennt_durch.jpg~
 1660016 Glühwendel_brennt_durch.jpg~.xz
 1659874 Glühwendel_brennt_durch.jpg~~
     117 thwart.php
28999654 total

Anders Kaseorg

Posted 2014-05-23T22:53:03.150

Reputation: 29 242

2

shell script, 203

id|gpg --batch --passphrase-fd 0 --personal-compress-preferences Uncompressed $1 $2

Running it:

% sh break.sh -c pg100.txt                       
% sh break.sh -d pg100.txt.gpg > pg100.txt-original
gpg: CAST5 encrypted data
gpg: encrypted with 1 passphrase
gpg: WARNING: message was not integrity protected
% diff -s pg100.txt pg100.txt-original
Files pg100.txt and pg100.txt-original are identical
% sh break.sh -c Glühwendel_brennt_durch.jpg
% sh break.sh -d Glühwendel_brennt_durch.jpg.gpg > Glühwendel_brennt_durch.jpg-original
gpg: CAST5 encrypted data
gpg: encrypted with 1 passphrase
gpg: WARNING: message was not integrity protected
% diff -s Glühwendel_brennt_durch.jpg Glühwendel_brennt_durch.jpg-original
Files Glühwendel_brennt_durch.jpg and Glühwendel_brennt_durch.jpg-original are identical
% xz -kz5 Glühwendel_brennt_durch.jpg.gpg 
% xz -kz5 pg100.txt.gpg 
% ls -ln
total 28340
-rw-r--r-- 1 1000 1000      84 May 24 04:33 break.sh
-rw-r--r-- 1 1000 1000 1659874 Jan 19 17:22 Glühwendel_brennt_durch.jpg
-rw-r--r-- 1 1000 1000 1659943 May 24 04:46 Glühwendel_brennt_durch.jpg.gpg
-rw-r--r-- 1 1000 1000 1660084 May 24 04:46 Glühwendel_brennt_durch.jpg.gpg.xz
-rw-r--r-- 1 1000 1000 1659874 May 24 04:46 Glühwendel_brennt_durch.jpg-original
-rw-r--r-- 1 1000 1000 5589891 May 24 03:55 pg100.txt
-rw-r--r-- 1 1000 1000 5589941 May 24 04:43 pg100.txt.gpg
-rw-r--r-- 1 1000 1000 5590284 May 24 04:43 pg100.txt.gpg.xz
-rw-r--r-- 1 1000 1000 5589891 May 24 04:43 pg100.txt-original

Not very portable, but could be made at the cost of a couple of bytes. Requires PGP (an implementation with OpenSSL would be possible, too). The ~50 byte difference between encoded file and original can probably be saved.

Scoring:

84 + abs(1659874 - 1659943) + max(1659874 - 1660084, 0) + abs(5589891 - 5589941) + max(5589891 - 5590284, 0) = 203

copy

Posted 2014-05-23T22:53:03.150

Reputation: 6 466

1

Python, score=183+7+6+0+0=196

The scoring penalizes you for making the file completely uncompressible, since then the compressed file is larger from the compression overhead. Thus my program makes them slightly less than totally uncompressible:

import sys
from random import randint as J,seed
x=sys.stdin.read()
seed(ord(x[1]))
n=int(2362*J(1,2)**2.359)
sys.stdout.write(x[:n]+''.join(chr(ord(c)^J(0,255))for c in x[n:]))

Result:

Laxori@Laxori-PC /cygdrive/f/Programming/lzkill
$ cat photo.jpg | python break.py > photo.jpg~; cat photo.jpg~ | python break.py > photo.jpg~~; diff photo.jpg photo.jpg~~; xz -kz5 photo.jpg~

Laxori@Laxori-PC /cygdrive/f/Programming/lzkill
$ cat pg100.txt | python break.py > pg100.txt~; cat pg100.txt~ | python break.py > pg100.txt~~; diff pg100.txt pg100.txt~~; xz -kz5 pg100.txt~

Laxori@Laxori-PC /cygdrive/f/Programming/lzkill
$ ls -l
total 28337
----------+ 1 Laxori mkpasswd     183 2014-05-24 13:43 break.py
----------+ 1 Laxori mkpasswd 5589891 2014-05-23 19:19 pg100.txt
-rw-r--r--+ 1 Laxori mkpasswd 5589891 2014-05-24 13:45 pg100.txt~
-rw-r--r--+ 1 Laxori mkpasswd 5589884 2014-05-24 13:45 pg100.txt~.xz
-rw-r--r--+ 1 Laxori mkpasswd 5589891 2014-05-24 13:45 pg100.txt~~
----------+ 1 Laxori mkpasswd 1659874 2014-05-23 19:19 photo.jpg
-rw-r--r--+ 1 Laxori mkpasswd 1659874 2014-05-24 13:44 photo.jpg~
-rw-r--r--+ 1 Laxori mkpasswd 1659880 2014-05-24 13:44 photo.jpg~.xz
-rw-r--r--+ 1 Laxori mkpasswd 1659874 2014-05-24 13:44 photo.jpg~~

Laxori@Laxori-PC /cygdrive/f/Programming/lzkill
$ python
Python 2.5.2 (r252:60911, Dec  2 2008, 09:26:14)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 183 + abs(5589891-5589884) + abs(1659874-1659880)
196

Claudiu

Posted 2014-05-23T22:53:03.150

Reputation: 3 870

With the current rules, there is no penalty for a compressed file being larger. Did the rules change? If so, such change was unfair to this program. – kernigh – 2014-05-28T18:17:05.273

@kernigh: Yeah they changed after I posted it. Truly they should have been the way they are now from the start. – Claudiu – 2014-05-28T19:39:26.043