Diff compression

20

3

For this challenge, you need to compress a diff. A diff is some data that represents the difference between two strings. For this challenge, you need to provide one or more programs that can:

  1. Input A and B, and output a diff, C
  2. Input A and C, and output B
  3. Input B and C, and output A

The goal is to make the diff, C, as small as possible. The diff can be anything: a string, a number, a blob of data. We just care about the size (number of bytes).

I have 50 test cases that can found on Github. Each test case consists of two space-separated URLs which point to the 2 files you need to diff. (These test cases originated from PPCG members' Github profiles. Thanks all!)

All three tasks above should take under a minute to run on a reasonably powered computer (for each test case).

Your score is equal to the total size (in bytes) of all 50 diffs, lower is better. Hardcoding diffs in your program is not allowed (I reserve the right to change the test cases to prevent hardcoding). Builtins that produce a diff (like diffutils) are not allowed.

Nathan Merrill

Posted 2016-08-02T16:53:54.283

Reputation: 13 591

Question was closed 2017-01-17T16:20:22.723

4What precisely is a diff? – Conor O'Brien – 2016-08-02T16:54:58.730

Anything you want it to be, really. Informally, its a string that represents the differences between A and B – Nathan Merrill – 2016-08-02T16:56:14.753

For tasks 2 and 3, do we have to dynamically figure out which which file is which, i.e., do we get the contents or the name of the file as input? Also, is built-in compression allowed? – Dennis – 2016-08-02T18:12:19.450

@Dennis You can make 3 different functions/programs if you'd like, and assume that they each get the correct input (as a string or filename). Built-in compression is allowed. – Nathan Merrill – 2016-08-02T18:14:59.983

In other words, you can assume that 1,2,3 are completely independent tasks. – Nathan Merrill – 2016-08-02T18:15:46.403

You explain what is a diff, but what is a string? Any sequence of bytes? Just printables? Can it contains null bytes? – edc65 – 2016-08-02T18:50:40.510

@edc65 a sequence of characters. You should be able to handle unprintables. However, I don't care if you assume a certain encoding.

– Nathan Merrill – 2016-08-02T19:12:50.517

Can we assume all input is ASCII? – Socratic Phoenix – 2016-08-02T20:30:21.583

@SocraticPhoenix yes. – Nathan Merrill – 2016-08-02T21:14:28.563

Hmm.. can we also compress the result of our algorithm with gzip or similar? (Edit: am I asking the same thing as Dennis?) – Socratic Phoenix – 2016-08-02T21:25:35.270

@SocraticPhoenix yes, that's the same question as Dennis, and its still allowed :) – Nathan Merrill – 2016-08-02T21:48:43.233

So we are allowed to depend on the fact that A is first and B is second? Basically, can we use the order of the input to our advantage? – Socratic Phoenix – 2016-08-06T02:07:03.077

@SocraticPhoenix absolutely. – Nathan Merrill – 2016-08-06T02:07:29.723

umm.. file not found Edit: 13th line, first entry

– Socratic Phoenix – 2016-08-06T02:49:26.853

Bah, link rot already? There are now 49 test cases. – Nathan Merrill – 2016-08-06T02:52:29.283

1More link rot: numbering test case pairs by 1-base line index; both pairs of test cases 3, 13, 14, 15, 16, 17, 18, 19, 20, 21 are all 404. Outside of these, I managed to retrieve every other case. – H Walters – 2017-01-16T03:10:38.110

3I'm closing this question because it is largely unanswered and many of the old links I was using as test cases no longer work. Feel free to update the question and reopen if you wish. – Nathan Merrill – 2017-01-17T16:10:06.857

I think it is still an interesting question but maybe you could update the links to have a static .tar file on github? – Seth – 2017-01-17T17:41:49.140

@Seth I agree, but getting all of the links was quite a bit of work, and I'm not really motivated to do it again. If you are up for the task, be my guest :) – Nathan Merrill – 2017-01-17T17:49:54.107

1

Done. The GIST is https://gist.github.com/sethhillbrand/64066935e3f8c0fac75d75edd43c9ef8

The second file is a uuencoded archive of the 40 remaining pairs of test cases.

– Seth – 2017-01-17T18:44:37.423

@Dennis: Question doesn't seem unclear. Could you add a comment to specify what your concern is? – Seth – 2017-01-17T22:44:43.040

@Seth sorry for the late reply. The question was closed by my request. I've tried opening up your commits.uu file, and I'm struggling to get it decoded, which is why I haven't reopened the question. – Nathan Merrill – 2017-01-17T22:47:40.783

Got it. To simplify, I've updated the gist with the files base64 encoded as this seems to be more standard than uudecode these days. To extract the files, save the gist as commits.tar.xz.64 and execute base64 -d commits.tar.xz.64|tar -xvJ – Seth – 2017-01-17T23:22:41.183

Answers

0

Is my answer valid?

set f [open commits.txt]
while {![eof $f]} {scan [gets $f] %s\ %s a b; puts [string compare $a $b]}
close $f

testable on: http://www.tutorialspoint.com/execute_tcl_online.php?PID=0Bw_CjBb95KQMNmd4QkxvQUFsTnM

sergiol

Posted 2016-08-02T16:53:54.283

Reputation: 3 055

1You need to provide multiple programs (both a diff equivalent and a patch equivalent). If string compare diffs strings, it violates the "no builtins" rule. If it only compares strings (like the name suggests), it doesn't leave enough information to recreate a patch. – None – 2017-01-16T22:50:55.430

@ais523: builtins I understood it as command line commands. I know that string compare does not generate info to create a page, but there is no place in the question asking for it. – sergiol – 2017-01-16T22:54:13.323

From the question, "2. Input A and C, and output B". This is something that your submitted program can't do, and that in fact no program could do (as it doesn't have enough information). – None – 2017-01-16T22:56:06.253

@ais523: OK I misunderstood. – sergiol – 2017-01-16T23:03:10.280

@ais523: I don't think your statement is correct "in fact no program could do". If C is the diff between A and B, then given C and A, B is calculable. Maybe I missed your exact point – Seth – 2017-01-17T17:43:55.690

@Seth: Sergiol's definition of a "diff" destroys data, so it isn't reversible (and isn't really a diff). That was the point I was trying to get at. The "write a program to reverse the diffing operation" part of the question is a nice objective way to ensure that the diff actually is a diff. – None – 2017-01-17T20:10:21.840

@ais523: That makes sense. Agreed. – Seth – 2017-01-17T20:53:36.773