20
3
For this challenge, you need to compress a diff. A diff is some data that represents the difference between two strings. For this challenge, you need to provide one or more programs that can:
- Input
A
andB
, and output a diff,C
- Input
A
andC
, and outputB
- Input
B
andC
, and outputA
The goal is to make the diff, C
, as small as possible. The diff can be anything: a string, a number, a blob of data. We just care about the size (number of bytes).
I have 50 test cases that can found on Github. Each test case consists of two space-separated URLs which point to the 2 files you need to diff. (These test cases originated from PPCG members' Github profiles. Thanks all!)
All three tasks above should take under a minute to run on a reasonably powered computer (for each test case).
Your score is equal to the total size (in bytes) of all 50 diffs, lower is better. Hardcoding diffs in your program is not allowed (I reserve the right to change the test cases to prevent hardcoding). Builtins that produce a diff (like diffutils
) are not allowed.
4What precisely is a diff? – Conor O'Brien – 2016-08-02T16:54:58.730
Anything you want it to be, really. Informally, its a string that represents the differences between
A
andB
– Nathan Merrill – 2016-08-02T16:56:14.753For tasks 2 and 3, do we have to dynamically figure out which which file is which, i.e., do we get the contents or the name of the file as input? Also, is built-in compression allowed? – Dennis – 2016-08-02T18:12:19.450
@Dennis You can make 3 different functions/programs if you'd like, and assume that they each get the correct input (as a string or filename). Built-in compression is allowed. – Nathan Merrill – 2016-08-02T18:14:59.983
In other words, you can assume that 1,2,3 are completely independent tasks. – Nathan Merrill – 2016-08-02T18:15:46.403
You explain what is a diff, but what is a string? Any sequence of bytes? Just printables? Can it contains null bytes? – edc65 – 2016-08-02T18:50:40.510
@edc65 a sequence of characters. You should be able to handle unprintables. However, I don't care if you assume a certain encoding.
– Nathan Merrill – 2016-08-02T19:12:50.517Can we assume all input is ASCII? – Socratic Phoenix – 2016-08-02T20:30:21.583
@SocraticPhoenix yes. – Nathan Merrill – 2016-08-02T21:14:28.563
Hmm.. can we also compress the result of our algorithm with gzip or similar? (Edit: am I asking the same thing as Dennis?) – Socratic Phoenix – 2016-08-02T21:25:35.270
@SocraticPhoenix yes, that's the same question as Dennis, and its still allowed :) – Nathan Merrill – 2016-08-02T21:48:43.233
So we are allowed to depend on the fact that A is first and B is second? Basically, can we use the order of the input to our advantage? – Socratic Phoenix – 2016-08-06T02:07:03.077
@SocraticPhoenix absolutely. – Nathan Merrill – 2016-08-06T02:07:29.723
umm.. file not found Edit: 13th line, first entry
– Socratic Phoenix – 2016-08-06T02:49:26.853Bah, link rot already? There are now 49 test cases. – Nathan Merrill – 2016-08-06T02:52:29.283
1More link rot: numbering test case pairs by 1-base line index; both pairs of test cases 3, 13, 14, 15, 16, 17, 18, 19, 20, 21 are all 404. Outside of these, I managed to retrieve every other case. – H Walters – 2017-01-16T03:10:38.110
3I'm closing this question because it is largely unanswered and many of the old links I was using as test cases no longer work. Feel free to update the question and reopen if you wish. – Nathan Merrill – 2017-01-17T16:10:06.857
I think it is still an interesting question but maybe you could update the links to have a static .tar file on github? – Seth – 2017-01-17T17:41:49.140
@Seth I agree, but getting all of the links was quite a bit of work, and I'm not really motivated to do it again. If you are up for the task, be my guest :) – Nathan Merrill – 2017-01-17T17:49:54.107
1
Done. The GIST is https://gist.github.com/sethhillbrand/64066935e3f8c0fac75d75edd43c9ef8
The second file is a uuencoded archive of the 40 remaining pairs of test cases.
@Dennis: Question doesn't seem unclear. Could you add a comment to specify what your concern is? – Seth – 2017-01-17T22:44:43.040
@Seth sorry for the late reply. The question was closed by my request. I've tried opening up your
commits.uu
file, and I'm struggling to get it decoded, which is why I haven't reopened the question. – Nathan Merrill – 2017-01-17T22:47:40.783Got it. To simplify, I've updated the gist with the files base64 encoded as this seems to be more standard than uudecode these days. To extract the files, save the gist as commits.tar.xz.64 and execute
base64 -d commits.tar.xz.64|tar -xvJ
– Seth – 2017-01-17T23:22:41.183