Golf an LZW encoder

6

0

Given an alphabet and a string, your job is to create the Lempel–Ziv–Welch compression of the string. Your implementation can either be a function with two parameters and a return value, or a full program that uses stdin and stdout.

Input

  • The alphabet, in the form of a string, from which you will have to create the initial dictionary: Each character should be mapped to its index in the string, where the first index is 0. You will be expanding this dictionary as you iterate through the algorithm.

  • The string to be compressed.

Output

  • The compressed string in the form of dictionary keys. This must be an array of integers if you're creating a function, or printed integers separated by either whitespace or newline (and nothing else) if you're making a full program.

Algorithm

After you have initialized the dictionary, the algorithm goes something like this:

  • Find the longest string W in the dictionary that matches the current input.
  • Emit the dictionary index for W to output and remove W from the input.
  • Add W followed by the next symbol in the input to the dictionary.
  • Repeat

Examples

"ABC", "AABAACABBA"  ==>  [0,0,1,3,2,4,5]
"be torn", "to be or not to be"  ==> [3,4,2,0,1,2,4,5,2,6,4,3,2,7,9,1]

Rules

  • No use of external resources
  • No predefined built-in functions or libraries of any kind that solves the problem for you.
  • Fore!

daniero

Posted 2014-02-05T13:08:50.827

Reputation: 17 193

Answers

4

GolfScript (52 51 chars)

{[\1/\{\.{.,3$<=},-1=.2$?@@,:L)[3$<]+@L>.}do;;]}:C;

Online demo

Dissection

{  }:C;

is standard function boilerplate.

# Wrap the values we compute in an array
[
    # Stack: alphabet uncompressed-string
    # Split the alphabet from a string into an array of 1-char strings
    \1/\
    # Stack: dictionary uncompressed-string
    {
        # Stack: ... dictionary uncompressed-string-suffix
        # Find the dictionary elements which are prefixes of uncompressed-string-suffix
        \.{.,3$<=},
        # The last of them must be the longest by construction
        -1=
        # Stack: ... uncompressed-string-suffix dictionary prefix
        # Find the index of prefix in dictionary
        .2$?
        # Stack: ... uncompressed-string-suffix dictionary prefix index
        # Push index down the stack
        @@
        # Stack: ... uncompressed-string-suffix index dictionary prefix
        # Assign len(prefix) to L and append a 1-char-longer prefix to dictionary
        ,:L)[3$<]+
        # Fetch up the uncompressed-string-suffix and chop off the first L chars
        @L>
        # Stack: ... index dictionary' uncompressed-string-suffix'
        # Duplicate uncompressed-string-suffix' for the do-loop condition test
        .
    }do
    # Stack: index0 index1 ... indexN dictionary' ""
    # Pop the last two to leave just the indexes
    ;;
]

Peter Taylor

Posted 2014-02-05T13:08:50.827

Reputation: 41 901

2I would probably upvote this if you provided an explanation of how it works :) – Timwi – 2014-02-07T23:11:54.377

3

Python (141 Characters)

def a(b,c,r=[]):
    if c:l,i,d=max((len(d),i,d) for i,d in enumerate(b) if c.find(d)==0)
    return a(list(b)+[c[:l+1]],c[l:],r+[i]) if c else r

Not golfed very small, but no chance of beating any of the golfscript solutions anyway.

schizodactyl

Posted 2014-02-05T13:08:50.827

Reputation: 161

2

Racket/R5RS Scheme: 359 bytes

(define(l d i)(let*((h(make-hash))(s string->list)(r(λ(x)(hash-ref h x #f)))(i(s i))(s(let l((s 0)(d(s d)))(if(null? d)s
(and(hash-set! h(car d)s)(l(+ s 1)(cdr d)))))))(let l((i(cdr i))(c(r(car i)))(o'())(s s))(if(null? i)(reverse(cons c o))
(let*((q(cons(car i)c))(n(r q)))(if n(l(cdr i) n o s)(and(hash-set! h q s)(l(cdr i)(r(car i))(cons c o)(+ s 1)))))))))

Usage:

(l "ABC" "AABAACABBA")             ; ==> (0 0 1 3 2 4 5)
(l "be torn" "to be or not to be") ; ==> (3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678

2

PHP 220

implemented as a function

<? function l($d,$s){@$n=strlen;@$r=substr;$d=str_split($d);$o=[];while($s){$l=0;foreach($d as$k=>$v){$b=$n($v);if($b<=$n($s)&&!strncmp($s,$v,$b)&&$b>$l){$l=$b;$i=$k;}}$o[]=$i;$d[]=$r($s,0,$l+1);$s=$r($s,$l);}return $o;}

Einacio

Posted 2014-02-05T13:08:50.827

Reputation: 436

The spec says that if you implement a function you should return the array: printing is for implementing a program. Correcting this actually saves you 3 chars. You can also save by removing the space after <? (TBH I'm not sure if the <? is actually required - maybe ask on meta?), and replacing if(foo)if(bar)if(baz) with if(foo&&bar&&baz). – Peter Taylor – 2014-02-05T20:03:00.423

@PeterTaylor i read wrong the bit about return. thanks for the correction and tips – Einacio – 2014-02-05T20:06:20.050

2

Perl, 188 177 156 145 137 112

sub l{($_,$s,@o)=@_;@d=split'';while($s){$s=~s/^$d[$_](.?)/push@o,$_;push@d,$&;$1/e&&last for reverse 0..$#d}@o}

i.e.

sub l {
    ($_,$s,@o)=@_;
    @d=split'';
    while($s){
        $s=~s/^$d[$_](.?)/push@o,$_;push@d,$&;$1/e
            && last for reverse 0..$#d
    }
    @o
}

Can be 109 if global @o is expected undefined or empty when entering sub. And yes, global variables are modified. Below is properly un-golfed version with lexicals.

use strict;
use warnings;

sub lzw {
    my ($alphabet, $string, @out) = @_;
    my @dict = split '', $alphabet;
    LOOP: while ($string) {
        for my $k (reverse 0..$#dict) {
            if ($string =~ s/^$dict[$k](.?)/$1/) {
                push @out, $k;
                push @dict, $&;
                next LOOP
            } 
        }
        die "we shouldn't be here!\n"
    }
    return @out
}

print qq/@{[lzw("ABC", "AABAACABBA")]}\n/;
print qq/@{[lzw("be torn", "to be or not to be")]}\n/;

.

perl lzw.pl
0 0 1 3 2 4 5
3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1

user2846289

Posted 2014-02-05T13:08:50.827

Reputation: 1 541

2

PHP: 176

Just making it return a function would require one byte less. It utilizes PHP arrays ($d)as Trie trees:

<? function l($d,$s){@$S=str_split;$o=[];$d=array_flip($S($d));foreach($S($s)AS$v){if(null===$t=@$d[$c.$v]){$d[$c.$v]=count($d);$o[]=$c;$c=$d[$v];}else$c=$t;}$o[]=$c;return$o;}

Usage:

print_r(l("ABC","AABAACABBA")) ; ==> array(0, 0, 1, 3, 2, 4, 5)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678

2

SWI-Prolog, 313 244

The list of alphabet is no longer reversed. And it uses nth0 to brute force index + prefix pair.

b([],[]).
b([H|T],[[H]|R]):-b(T,R).
l(A,W,O):-b(A,B),r(B,W,O).
r(A,W,[I|T]):-m(A,W,I,E,0),append(E,R,W),(R=[N|_],!,append(E,[N],C),append(A,[C],B),r(B,R,T);T=[]).
m(A,W,I,E,N):-nth0(J,A,F),prefix(F,W),length(F,L),L>N,!,(m(A,W,I,E,L),!;I=J,E=F).

Usage:

l("be torn", "to be or not to be",O).
O = [3,4,2,0,1,2,4,5,2,6,4,3,2,7,9,1].

Old version

313 chars

Working on a reversed list of alphabet, which is convenient when I need to add new alphabet, but it is more roundabout to get the index.

This should do less work than the 244 chars version, but more code.

b([],O,O).
b([H|T],O,A):-b(T,O,[[H]|A]).
l(A,W,O):-b(A,B,[]),r(B,W,O).
r(A,W,[I|T]):-m(A,W,I,E,_),I>=0,append(E,R,W),(R=[N|_],!,append(E,[N],C),r([C|A],R,T);T=[]).
m([],_,-1,_,0).
m([H|T],W,I,E,M):-prefix(H,W),!,length(H,L),m(T,W,J,F,N),(N>L,!,M=N,I=J,E=F;M=L,length(T,D),I=D,E=H).
m([_|T],W,I,E,M):-m(T,W,I,E,M).

n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳

Posted 2014-02-05T13:08:50.827

Reputation: 5 683

1

GolfScript, 53 characters

n%~\1/\{.,,{)1$<2$?}%{)},-1=.p@.@=,:^)2$[<]+\^>.}do;;

The input must be given as two lines, first alphabet, second string to compress. The second example can be tested online.

Howard

Posted 2014-02-05T13:08:50.827

Reputation: 23 109

I'm currently working on a different approach which should be shorter - unfortunately not yet finished and I'm running out of time... – Howard – 2014-02-05T14:18:37.600

Our answers look quite similar apart from the identification of the longest known prefix. – Peter Taylor – 2014-02-05T15:40:48.897

1

Postscript, 388 267

It's my Perl answer re-written, just for fun:

/lzw {
    10 dict begin 
    /str exch def
    /alphabet exch def
    /concat
        {dup length 2 index length add string dup dup 4 index length 
        5 -1 roll putinterval 0 4 -1 roll putinterval}
    def
    /push
        {1 index load dup length /_n exch def 
        [exch aload pop _n 2 add -1 roll]def}
    def

    /dict [-1 alphabet length {
        1 add dup alphabet exch 1 getinterval exch
    } repeat pop] def
    [
    {
        str length 0 eq {exit} if
        dict length dup {
            1 sub dup
            dict exch get
            str exch anchorsearch
            {
                1 index
                dup length 0 ne {0 1 getinterval} if concat 
                /dict exch push
                /str exch def
                exit 
            } {pop} ifelse
        } repeat
    } loop
    ]

    end
} def

(ABC)(AABAACABBA) lzw
{10 string cvs print( )print} forall
(\n)print
(be torn)(to be or not to be) lzw
{10 string cvs print( )print} forall

and

gs -dBATCH -q lzw.ps
0 0 1 3 2 4 5
3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1

Procedure expects 2 strings on stack and returns an array. Not sure about quality of my Postscript, but with little help of 'binarization' (binary tokens), compression (LZW 4 bytes more efficient than Deflate) and encoding, it can be golfed to something like that:

/l<~J.+mR+f\'O00,h`aW[%a_[;,+b\+;Ei?VkM,`<7VO9Ho+MT&&IW4:>B/Jt:Fkp+9b5W"\IAS74n7,WEV`lsSZ&A6g1)4'7^,a$SAD@f9&15SkO))[/:$Z^e$4N0:?`F7fk8MV#<6.'Dr'AOkP"!5De'2kXm`>,I(@^ERI/e\Tneg#0M3_7tOQWJ$Y0n%F-#bWHFEC6hWX'hBaBu?;D~>/LZWDecode filter 500 string readstring pop cvx def

And that's 267 bytes procedure of pure ASCII code -- still not the longest of the answers :-)

user2846289

Posted 2014-02-05T13:08:50.827

Reputation: 1 541

1

Extended BrainFuck: 490 bytes

I know I won't win here, but I love compression algorithms. It was fun writing, not too difficult and shorter than I imagined :)

{l[>>]}{k[<<]}{j&k +&l }{i]>>[-<<+>>]}{h,10-[22-[-}{g&k>>[->>]}{f>]3<[-3>+3<]}{e>+3>&l+<<&k}{d&k>[-3<+4>&l}{c 48+[-<+<+>>]<[.[-]<]10+.[-]}{a 10+<[->-[>+>>]>[+[-<+>]>+>>]5<]>[-]}{b 5<&k&f>>&g}192>&h 4<&j>>]<[-3<&k>+>&l>>+<]3<&g>>+[-<+>],10-]<[-<+>]<[->+>+<<]3>,32-[-5<&j 3>]5<&d 3>+&b 4>&h 6<&j 4>]<5<&d 4>+<&b>[-<+3>[->&e 3<&i 4<]<[->+<]4>+[-&e<<]>-[-<+>]3>&l<[-3>+4<&k<+3>&l<]3>[-3<+3>]4<&k+<[<<[-&i>-<]>[-<<[-4<+4>]4>-4<&a 4>+<&c<<[->+5>&l<+<&k 4<]>+[-<+&f 6>]>>&l<<[-<<]<<[-],10-]&a 3>&c

Usage:

$ bf ebf.bf < lzw.ebf > lzw.bf
$ bf lzw.bf <<eof
> ABC
> AABAACABBA
> eof
00
00
01
03
02
04
05

To support codes above 99 I need to add a stack structure. Doable, but since both test cases don't even use the second digit I though I'd stop here. Here's the resulting BrainFuck code (1096):

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>,----------[----------------------[-<<<<[<<]+[>>
]>>]<[-<<<[<<]>+>[>>]>>+<]<<<[<<]>>[->>]>>+[-<+>],----------]<[-<+>]<[->+>+<<]>>
>,--------------------------------[-<<<<<[<<]+[>>]>>>]<<<<<[<<]>[-<<<+>>>>[>>]>>
>+<<<<<[<<]>]<<<[->>>+<<<]>>[<<]>>[->>]>>>>,----------[----------------------[-<
<<<<<[<<]+[>>]>>>>]<<<<<<[<<]>[-<<<+>>>>[>>]>>>>+<<<<<<[<<]>]<<<[->>>+<<<]>>[<<]
>>[->>]>[-<+>>>[->>+>>>[>>]+<<[<<]<<<]>>[-<<+>>]<<<<]<[->+<]>>>>+[->+>>>[>>]+<<[
<<]<<]>-[-<+>]>>>[>>]<[->>>+<<<<[<<]<+>>>[>>]<]>>>[-<<<+>>>]<<<<[<<]+<[<<[-]>>[-
<<+>>]>-<]>[-<<[-<<<<+>>>>]>>>>-<<<<++++++++++<[->-[>+>>]>[+[-<+>]>+>>]<<<<<]>[-
]>>>>+<++++++++++++++++++++++++++++++++++++++++++++++++[-<+<+>>]<[.[-]<]++++++++
++.[-]<<[->+>>>>>[>>]<+<[<<]<<<<]>+[-<+>]<<<[->>>+<<<]>>>>>>]>>[>>]<<[-<<]<<[-],
----------]++++++++++<[->-[>+>>]>[+[-<+>]>+>>]<<<<<]>[-]>>>+++++++++++++++++++++
+++++++++++++++++++++++++++[-<+<+>>]<[.[-]<]++++++++++.[-]

Keep in mind that I have aimed for short EBF code, not BF code. Here's the ungolfed EBF source code source code (3,5kB):

;; The data type for the lookup array
:lc the crumble for lookup array
:lv the value for this byte
:lz always zero

;; macros for lookup
;; move to open area
{to_lookup $lc(@lz)}

;; move back
{from_lookup $lz(@lc)}

{open_lookup (- &to_lookup+
                @lz &from_lookup)
}

{close_lookup &to_lookup
              $lz(-@lc)
}

;; makes a move to a higher 
;; crumble and increments it
{lookup_backup
   $lc@lz
   $lc+$lz@lc
}

;; restore backup
{lookup_restore
    $lc@lz
    $lc(-$lz@lc^0+$lc@lz)
    $lz@lc
}

;; opens the lookup with the 
;; index of the current register
;; and replaces it with the
;; code representing that 
{lookup_value
    &open_lookup
    &to_lookup
    $lv(- &lookup_backup
          &from_lookup
          ^0+
          &to_lookup)
    &lookup_restore
    &close_lookup
}

;; We have a possible alphabet of ASCII 32-126=94 and 
;; we need an empty as well *  2 slots/element = 190
190> @lz

;; The variables we use
:lec  lookup element count
:ndv  next dictionary value
:prev previos match
:cur  current index
:ax   general purpose register a

;; The trie tree data structure
:tz  the empty element
:tv  the value of this node
:tc  the crumble of this node

;; macros for the trie structure
{to_trie $tc(@tz)}

{from_trie $tz(@tc)}

;; makes a move to a higher 
;; crumble and increments it
{trie_backup
   $tc@tz
   $tc+$tz@tc
}

;; restore backup
{trie_restore
    $tc@tz
    $tc(-$tz@tc^0+$tc@tz)
    $tz@tc
}

{open_trie
  ;; we take prev times lec first
  ;; we use %lz as temporary for lec
  ;; and 
  $lec(-
        $lz+              ; backup lec
        $prev(-
               $ax+       ; backup prev 
               &to_trie+
               &from_trie
             )
        $ax(-$prev+)      ; restore backup
      )
  $lz(-$lec+)             ; restore lec
  $cur+(-
          $ax+ 
          &to_trie+
          &from_trie)
  $ax-(-$cur+)
}

{trie_close &to_trie $tz(-@tc)}


;; divmod divides ^0 with ^1
;; leaving remainder in ^2
;; and quotient in ^3
;; Needs up to ^5 for working area
{divmod
  (-^1-
      [^2+^4]
      ^5[*-3+[-^1+^2]^3+^5])
}

;; First we need to fill our lookup
;; we'll use ndv
$ndv,10-
( ; while not lf
  22-
  &open_lookup
  $lec(- &to_lookup $lv+
         &from_lookup $ndv+)
  &close_lookup
  $ndv+(-$lec+)
  %ndv , 10-
)

;; Now that we have all symbols we make a copy 
$lec(-$lz+)
$lz(-$lec+$ndv+)
#

$prev, 32- &lookup_value
$cur, 10-
(
  22-
  #&lookup_value
  #
  &open_trie

  ;; somehow if tv is we copy it to ax and reset prev
  ;; if prev is set we copy it to tv and 
  &to_trie
  $tv(-
        &trie_backup
        &from_trie
        $ax+
        &to_trie
     )
  &trie_restore
  &from_trie
  $tz+
  $ax( $prev(-)$ax(-$prev+)$tz-)
  $tz(-
      $cur(-$lz+) ; make copy of cur
      $tc-
      $cur 10+ $prev &divmod
      $cur(-)
      $tc+
      $tv 48+(-$tz+$ax+)
      $tz[.(-)<]@cur
      10+.(-)
      $ndv(- $prev+
             &to_trie $tv+
             &from_trie
          )
      $prev+(-$ndv+)
      $lz(-$prev+)           
   )
   &trie_close
   $cur(-), 10-
)
#$cur 10+ $prev &divmod
$cur(-)
$tv 48+(-$tz+$ax+)
$tz[.(-)<]@cur
10+.(-)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678