Analysing the most common words in a text, and replacing them with a single byte

This is a very crude text encoding algorithm I came up with the other day. It won't work if the text has more than 256 different characters in it – but it does support Unicode, at least in my Python implementation.

The algorithm:

Find all the unique characters in the text. Let's call this list, unique-chars.
Break the text into a sequence of tokens, which can be either words (continuous runs of alphanumeric characters, and apostrophes) and symbols (anything that isn't a word). If there's two symbols (say, a comma and a space), take each only once.
Sort the tokens, most commonly occurring first. Then, take the top (256 - length of unique-chars). Put the two lists – unique-chars, and most-common-tokens – together into one. Let's call this list substitution-table.
Iterate through each token. If it's present in substitution-table, take the index in which this token is found, and convert it into a character. Otherwise, iterate through each letter in the token, and perform the same character conversion.

This way, the most common words become one byte long; and those which are not as common, stay of the same length. I know, this is not very efficient. But it works.

For reference, use King James' Bible, by Project Gutenberg; 4 452 069 bytes, MD5: 3b83dd48fa7db8df0819108f6d50f64f. Download here.

The processed version is 3 422 006 bytes, MD5: 17c52a22abe48dd4135a577e2a4aff8d. Download here.

No restrictions apply on the language used, external libraries/programs, dirty tricks, or operating system.

My Python 3 implementation runs in about 3.3–3.5 seconds, in average. Speed has precedence on everything else. Compiled language implementations, especially in C, would be appreciated. A solution in APL or Perl would be most welcome, I'm curious how they'd look like.

As for how to time the solution, I'd say, time 30 executions (possibly cleaning the RAM after each execution, in case caching occurs), and report both the lowest time (which is indicative limit as for how long it takes the machine to run the program in the best conditions) and the rest of times, for completeness' sake. Obviously, this is not perfectly accurate, but it'll do for an indicative value.

That said, happy coding~!

Jetlef

Posted 2014-02-18T21:09:47.877

Reputation: 121

It already exists and it's called 'LZMA' for short. wikipedia link: http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm

– Ismael Miguel – 2014-02-18T21:13:53.890

This seems flawed as a compression algorithm, because the decompressor doesn't have the substitution table. – Peter Taylor – 2014-02-18T21:15:00.407

@PeterTaylor the substation table can be exported as JSON in a separate file, but that doesn't really matter. What I want to do here is the compression only. – Jetlef – 2014-02-18T21:21:11.477

@IsmaelMiguel thanks for the link! But, from what I gather, LZMA is a generic compression tool. Mine is tailored specifically for text (I wasn't clear enough in the question?). Reading the "Dictionary coder" article on Wikipedia, I realise how my technique is similar (everything has already been invented…) – but I am not aiming to re-implement LZMA. I'd like to implement a simple (although less efficient) scheme. – Jetlef – 2014-02-18T21:27:08.440

It works better for text, so, I guess it is more focused on text. And according to your description, a LZMA encodes is "fine". – Ismael Miguel – 2014-02-18T21:30:35.103

@Jetlef Is there a chance that you can provide a smaller example rather than the Gutenberg Bible? Hamlet's soliloquy perhaps? It's a lot to go through... – WallyWest – 2014-02-18T22:39:25.870

@WallyWest Sure thing! Here's Poe's "The Raven", it's about 8 Kb.

– Jetlef – 2014-02-18T23:01:36.560

Answers

Perl - 154 bytes (KJB in ~1.94s)

while(<>){push@w,/[\w\']+|\W/g}
\@c{length>1?$w{$_}++?():/./g:$_}for@w;
@t{sort{$w{$b}-$w{$a}}keys%w}=grep!exists$c{$_},map{chr}0..255;
print$t{$_}//$_ for@w

This is substantially golfed, but I can say with confidence that it hasn't affected the performance in any noticable way.

Sample Usage:

$ perl encoder.pl < king-james-bible.txt > kjb.cmp

The created filesize is 3,253,677 bytes, runtime is approximately 1.94 seconds. If you want to test the correctness, you can export a translation table by appending the following three lines of code:

;use Storable;
%u=reverse%t;
store\%u,'trans.tbl'

This will create, and possibly overwrite, a file named trans.tbl in the current working directory. Then, decode with the following script:

#!perl -n0
use Storable;
%t=%{retrieve 'trans.tbl'};
print$t{$_}//$_ for/./sg

Sample usage:

$ perl decoder.pl < kjb.cmp > kjb.txt

The output is exactly 4,452,069 bytes, as expected.

Unrolled and commented

while(<>){      # for each line
  push@w,/[\w\']+|\W/g # tokenize, add tokens to array
}

\@c{            # mark the following chars
  length>1?     # token is 2 or more chars
    $w{$_}++    # increment occurrence in word hash
    ?           # if this word has already been seen
      ()        # mark nothing
    :           # else
      /./g      # mark each char individually
  :             # else
    $_          # mark the single char
}for@w;         # iterate over all tokens

@t{             # create the translation table
  sort{$w{$b}-$w{$a}}keys%w # most frequent words
}=grep!exists$c{$_},map{chr}0..255; # assign to an unmarked char

print           # output
  $t{$_}        # the value in the trans hash
    //          # or, if previous value is undefined
  $_            # output the untranslated string
for@w           # iterate over the tokens previously found

Time analysis

Min: 1.843s
Max: 1.984s
Avg: 1.939s

1.953, 1.953, 1.953, 1.843, 1.921, 1.937, 1.921, 1.953, 1.953, 1.921, 1.937, 1.968, 1.906, 1.984, 1.937, 1.968, 1.953, 1.921, 1.921, 1.937, 1.968, 1.937, 1.890, 1.968, 1.937, 1.953, 1.937, 1.937, 1.937, 1.953

According to profiling, the time usage for each statement is approximately the following:

1.09s:
while(<>){push@w,/[\w\']+|\W/g}

31.3ms:
\@c{length>1?$w{$_}++?():/./g:$_}for@w;

15.6ms:
@t{sort{$w{$b}-$w{$a}}keys%w}=grep!exists$c{$_},map{chr}0..255;

750ms:
print$t{$_}//$_for@w

Note that for the purpose of time usage analysis, the result was actually written to a file. Excluding the final output, or piping to /dev/null instead produces times 600ms to 750ms faster.

Major Performance Gains

The individual characters of a word are marked only the first time the word is seen, instead of every time. As noted by @VadimR, this alone saves around 2s, over 25% of the previous execution time.
The most frequent words are translated with unmarked characters, and marked characters left untranslated. This saves a great deal of time during output, because words not in the translation table can be output directly, without having to translate each individual character.
Tokenizing the file one line at a time - instead of all at once - is quite a bit faster.

primo

Posted 2014-02-18T21:09:47.877

Reputation: 30 891

1+1, it's beautifully golfed, but, because of 'fastest code' tag, I did some benchmarking and found that -40% boost can be gained with 'words' translation caching, and using substr to mark characters. I can add to your answer if you don't mind :-) – user2846289 – 2014-02-20T21:51:50.347

@VadimR I think I see what you mean. If a word has already been found, the it isn't necessary to mark its constituent characters again (and again, and again). I can confirm that using /./sg is notably faster than split, but I never thought to try substr. – primo – 2014-02-21T05:44:00.293

H-mm, but words not in translation table should be translated letter by letter, shouldn't they? Looking at what you've changed, - I meant something slightly different. – user2846289 – 2014-02-21T12:48:35.760

I already nearly wrote an answer (or I'd prefer adding to yours), when I found a little flaw in my tests. Now I'm waiting till I get back to my 'normal' PC with 64-bit linux OS for proper testing - so I'll elaborate on details later. – user2846289 – 2014-02-21T12:53:26.907

@VadimR There's no reason that a single character cannot be represented by the character itself. In fact, it seems like a quite logical thing to do. – primo – 2014-02-21T13:00:23.827

Oh, I see. Yes, it's reasonable. – user2846289 – 2014-02-21T13:13:18.310

With your marking characters of unseen words only, my ways to improve performance are no longer relevant. But, if you are interested, what I meant was: (1) re-use %w hash (to hold translations, not counts) - no longer needed as you don't translate most words - but it gave 10-15% boost; (2) use substr to access every single character instead of regexp - another 20-25%. Even now, when you mark chars in unseen words only, accessing every character with substr gives the same speed. – user2846289 – 2014-02-21T22:53:01.553

Consider first line of revision 5 code. Replacing it with $c{substr$s,$_,1}=1 for 0..length($s)-1; @w=$s=~/[\w\']+|./gs; $w{$_}++ for grep {length>1} @w; gives aforementioned boost (input in $s, next line has [0..255-keys%c] as hash slice range). – user2846289 – 2014-02-21T22:53:47.590

C# - 1.4s

I use a different metric for word sorting - 'word.Length -1' as this is the value of how many bytes would be saved by compressing said word. For example the word 'a' appears a lot, but saves nothing to compress (as it's already kept), and "the" saves twice as much as "at".

//Initial Size 4452066
//Final Size 3186918
//Time 1408 ms


using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

namespace SimpleCommonWordCompressor
{
    class Program
    {
        static void Main(string[] args)
        {
            while (true)
            {
                var data = File.ReadAllText("king-james-bible.txt");
                var sw = Stopwatch.StartNew();

                var uniqueChars = data.Distinct().Select(c => c.ToString()).ToArray();

                var tokens = new List<string>();
                var currentWord = new List<char>();
                Action tryAddWord = () =>
                {
                    if (currentWord.Count > 0)
                    {
                        tokens.Add(string.Join("", currentWord));
                        currentWord.Clear();
                    }
                };
                var charToStringLU = Enumerable.Range(0, 256).Select(i => ((char) i).ToString()).ToArray();
                foreach (var datum in data)
                    if (char.IsLetterOrDigit(datum)) currentWord.Add(datum);
                    else
                    {
                        tryAddWord();
                        tokens.Add(charToStringLU[datum]);
                    }
                tryAddWord();

                var grpTokens = tokens.GroupBy(x => x).ToArray();

                var wordTable = uniqueChars.Concat(grpTokens
                    .OrderByDescending(g => g.Count() * (g.Key.Length - 1))
                    .Select(grp => grp.Key)
                    .Take(256 - uniqueChars.Length)).ToList();

                var wordDict = wordTable.ToDictionary(w => w, w => new[] { (byte)wordTable.IndexOf(w) });
                var charTable = new byte[256];

                for (byte i = 0; i < uniqueChars.Length; i++)
                    charTable[uniqueChars[i][0]] = i;

                var tokenToBinary = grpTokens.ToDictionary
                    (grp => grp.Key, grp =>{
                        byte[] index;
                        if (wordDict.TryGetValue(grp.Key, out index)) return index;
                        var newWordBinary = grp.Key.Select(c => charTable[c]).ToArray();
                        return newWordBinary;
                    });


                var binary = new List<byte>();

                foreach (var token in tokens)
                    binary.AddRange(tokenToBinary[token]);

                sw.Stop();
                Console.WriteLine("Initial Size " + data.Length);
                Console.WriteLine("Final Size " + binary.Count);
                Console.WriteLine("Time " + sw.ElapsedMilliseconds + " ms\n");
                Console.WriteLine("Press enter to run again");
                Console.ReadLine();
            }
        }
    }
}

NPSF3000

Posted 2014-02-18T21:09:47.877

Reputation: 374