5
1
This is a very crude text encoding algorithm I came up with the other day. It won't work if the text has more than 256 different characters in it – but it does support Unicode, at least in my Python implementation.
The algorithm:
- Find all the unique characters in the text. Let's call this list, unique-chars.
- Break the text into a sequence of tokens, which can be either words (continuous runs of alphanumeric characters, and apostrophes) and symbols (anything that isn't a word). If there's two symbols (say, a comma and a space), take each only once.
- Sort the tokens, most commonly occurring first. Then, take the top (256 - length of unique-chars). Put the two lists – unique-chars, and most-common-tokens – together into one. Let's call this list substitution-table.
- Iterate through each token. If it's present in substitution-table, take the index in which this token is found, and convert it into a character. Otherwise, iterate through each letter in the token, and perform the same character conversion.
This way, the most common words become one byte long; and those which are not as common, stay of the same length. I know, this is not very efficient. But it works.
For reference, use King James' Bible, by Project Gutenberg; 4 452 069 bytes, MD5: 3b83dd48fa7db8df0819108f6d50f64f
. Download here.
The processed version is 3 422 006 bytes, MD5: 17c52a22abe48dd4135a577e2a4aff8d
. Download here.
No restrictions apply on the language used, external libraries/programs, dirty tricks, or operating system.
My Python 3 implementation runs in about 3.3–3.5 seconds, in average. Speed has precedence on everything else. Compiled language implementations, especially in C, would be appreciated. A solution in APL or Perl would be most welcome, I'm curious how they'd look like.
As for how to time the solution, I'd say, time 30 executions (possibly cleaning the RAM after each execution, in case caching occurs), and report both the lowest time (which is indicative limit as for how long it takes the machine to run the program in the best conditions) and the rest of times, for completeness' sake. Obviously, this is not perfectly accurate, but it'll do for an indicative value.
That said, happy coding~!
It already exists and it's called 'LZMA' for short. wikipedia link: http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm
– Ismael Miguel – 2014-02-18T21:13:53.890This seems flawed as a compression algorithm, because the decompressor doesn't have the substitution table. – Peter Taylor – 2014-02-18T21:15:00.407
@PeterTaylor the substation table can be exported as JSON in a separate file, but that doesn't really matter. What I want to do here is the compression only. – Jetlef – 2014-02-18T21:21:11.477
@IsmaelMiguel thanks for the link! But, from what I gather, LZMA is a generic compression tool. Mine is tailored specifically for text (I wasn't clear enough in the question?). Reading the "Dictionary coder" article on Wikipedia, I realise how my technique is similar (everything has already been invented…) – but I am not aiming to re-implement LZMA. I'd like to implement a simple (although less efficient) scheme. – Jetlef – 2014-02-18T21:27:08.440
It works better for text, so, I guess it is more focused on text. And according to your description, a LZMA encodes is "fine". – Ismael Miguel – 2014-02-18T21:30:35.103
@Jetlef Is there a chance that you can provide a smaller example rather than the Gutenberg Bible? Hamlet's soliloquy perhaps? It's a lot to go through... – WallyWest – 2014-02-18T22:39:25.870
@WallyWest Sure thing! Here's Poe's "The Raven", it's about 8 Kb.
– Jetlef – 2014-02-18T23:01:36.560