Given an alphabet and a string, your job is to create the Lempel–Ziv–Welch compression of the string. Your implementation can either be a function with two parameters and a return value, or a full program that uses stdin and stdout.

Input

The alphabet, in the form of a string, from which you will have to create the initial dictionary: Each character should be mapped to its index in the string, where the first index is 0. You will be expanding this dictionary as you iterate through the algorithm.
The string to be compressed.

Output

The compressed string in the form of dictionary keys. This must be an array of integers if you're creating a function, or printed integers separated by either whitespace or newline (and nothing else) if you're making a full program.

Algorithm

After you have initialized the dictionary, the algorithm goes something like this:

Find the longest string W in the dictionary that matches the current input.
Emit the dictionary index for W to output and remove W from the input.
Add W followed by the next symbol in the input to the dictionary.
Repeat

Examples

"ABC", "AABAACABBA"  ==>  [0,0,1,3,2,4,5]
"be torn", "to be or not to be"  ==> [3,4,2,0,1,2,4,5,2,6,4,3,2,7,9,1]

Rules

No use of external resources
No predefined built-in functions or libraries of any kind that solves the problem for you.
Fore!

daniero

Posted 2014-02-05T13:08:50.827

Reputation: 17 193

Answers

GolfScript (52 51 chars)

{[\1/\{\.{.,3$<=},-1=.2$?@@,:L)[3$<]+@L>.}do;;]}:C;

Online demo

Dissection

{  }:C;

is standard function boilerplate.

# Wrap the values we compute in an array
[
    # Stack: alphabet uncompressed-string
    # Split the alphabet from a string into an array of 1-char strings
    \1/\
    # Stack: dictionary uncompressed-string
    {
        # Stack: ... dictionary uncompressed-string-suffix
        # Find the dictionary elements which are prefixes of uncompressed-string-suffix
        \.{.,3$<=},
        # The last of them must be the longest by construction
        -1=
        # Stack: ... uncompressed-string-suffix dictionary prefix
        # Find the index of prefix in dictionary
        .2$?
        # Stack: ... uncompressed-string-suffix dictionary prefix index
        # Push index down the stack
        @@
        # Stack: ... uncompressed-string-suffix index dictionary prefix
        # Assign len(prefix) to L and append a 1-char-longer prefix to dictionary
        ,:L)[3$<]+
        # Fetch up the uncompressed-string-suffix and chop off the first L chars
        @L>
        # Stack: ... index dictionary' uncompressed-string-suffix'
        # Duplicate uncompressed-string-suffix' for the do-loop condition test
        .
    }do
    # Stack: index0 index1 ... indexN dictionary' ""
    # Pop the last two to leave just the indexes
    ;;
]

Peter Taylor

Posted 2014-02-05T13:08:50.827

Reputation: 41 901

2I would probably upvote this if you provided an explanation of how it works :) – Timwi – 2014-02-07T23:11:54.377

Python (141 Characters)

def a(b,c,r=[]):
    if c:l,i,d=max((len(d),i,d) for i,d in enumerate(b) if c.find(d)==0)
    return a(list(b)+[c[:l+1]],c[l:],r+[i]) if c else r

Not golfed very small, but no chance of beating any of the golfscript solutions anyway.

schizodactyl

Posted 2014-02-05T13:08:50.827

Reputation: 161

Racket/R5RS Scheme: 359 bytes

(define(l d i)(let*((h(make-hash))(s string->list)(r(λ(x)(hash-ref h x #f)))(i(s i))(s(let l((s 0)(d(s d)))(if(null? d)s
(and(hash-set! h(car d)s)(l(+ s 1)(cdr d)))))))(let l((i(cdr i))(c(r(car i)))(o'())(s s))(if(null? i)(reverse(cons c o))
(let*((q(cons(car i)c))(n(r q)))(if n(l(cdr i) n o s)(and(hash-set! h q s)(l(cdr i)(r(car i))(cons c o)(+ s 1)))))))))

Usage:

(l "ABC" "AABAACABBA")             ; ==> (0 0 1 3 2 4 5)
(l "be torn" "to be or not to be") ; ==> (3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678

PHP 220

implemented as a function

<? function l($d,$s){@$n=strlen;@$r=substr;$d=str_split($d);$o=[];while($s){$l=0;foreach($d as$k=>$v){$b=$n($v);if($b<=$n($s)&&!strncmp($s,$v,$b)&&$b>$l){$l=$b;$i=$k;}}$o[]=$i;$d[]=$r($s,0,$l+1);$s=$r($s,$l);}return $o;}

Einacio

Posted 2014-02-05T13:08:50.827

Reputation: 436

The spec says that if you implement a function you should return the array: printing is for implementing a program. Correcting this actually saves you 3 chars. You can also save by removing the space after <? (TBH I'm not sure if the <? is actually required - maybe ask on meta?), and replacing if(foo)if(bar)if(baz) with if(foo&&bar&&baz). – Peter Taylor – 2014-02-05T20:03:00.423

@PeterTaylor i read wrong the bit about return. thanks for the correction and tips – Einacio – 2014-02-05T20:06:20.050

Perl, 188 177 156 145 137 112

sub l{($_,$s,@o)=@_;@d=split'';while($s){$s=~s/^$d[$_](.?)/push@o,$_;push@d,$&;$1/e&&last for reverse 0..$#d}@o}

i.e.

sub l {
    ($_,$s,@o)=@_;
    @d=split'';
    while($s){
        $s=~s/^$d[$_](.?)/push@o,$_;push@d,$&;$1/e
            && last for reverse 0..$#d
    }
    @o
}

Can be 109 if global @o is expected undefined or empty when entering sub. And yes, global variables are modified. Below is properly un-golfed version with lexicals.

use strict;
use warnings;

sub lzw {
    my ($alphabet, $string, @out) = @_;
    my @dict = split '', $alphabet;
    LOOP: while ($string) {
        for my $k (reverse 0..$#dict) {
            if ($string =~ s/^$dict[$k](.?)/$1/) {
                push @out, $k;
                push @dict, $&;
                next LOOP
            } 
        }
        die "we shouldn't be here!\n"
    }
    return @out
}

print qq/@{[lzw("ABC", "AABAACABBA")]}\n/;
print qq/@{[lzw("be torn", "to be or not to be")]}\n/;

perl lzw.pl
0 0 1 3 2 4 5
3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1

user2846289

Posted 2014-02-05T13:08:50.827

Reputation: 1 541

PHP: 176

Just making it return a function would require one byte less. It utilizes PHP arrays ($d)as Trie trees:

<? function l($d,$s){@$S=str_split;$o=[];$d=array_flip($S($d));foreach($S($s)AS$v){if(null===$t=@$d[$c.$v]){$d[$c.$v]=count($d);$o[]=$c;$c=$d[$v];}else$c=$t;}$o[]=$c;return$o;}

Usage:

print_r(l("ABC","AABAACABBA")) ; ==> array(0, 0, 1, 3, 2, 4, 5)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678

SWI-Prolog, 313 244

The list of alphabet is no longer reversed. And it uses nth0 to brute force index + prefix pair.

b([],[]).
b([H|T],[[H]|R]):-b(T,R).
l(A,W,O):-b(A,B),r(B,W,O).
r(A,W,[I|T]):-m(A,W,I,E,0),append(E,R,W),(R=[N|_],!,append(E,[N],C),append(A,[C],B),r(B,R,T);T=[]).
m(A,W,I,E,N):-nth0(J,A,F),prefix(F,W),length(F,L),L>N,!,(m(A,W,I,E,L),!;I=J,E=F).

Usage:

l("be torn", "to be or not to be",O).
O = [3,4,2,0,1,2,4,5,2,6,4,3,2,7,9,1].

Old version

313 chars

Working on a reversed list of alphabet, which is convenient when I need to add new alphabet, but it is more roundabout to get the index.

This should do less work than the 244 chars version, but more code.

b([],O,O).
b([H|T],O,A):-b(T,O,[[H]|A]).
l(A,W,O):-b(A,B,[]),r(B,W,O).
r(A,W,[I|T]):-m(A,W,I,E,_),I>=0,append(E,R,W),(R=[N|_],!,append(E,[N],C),r([C|A],R,T);T=[]).
m([],_,-1,_,0).
m([H|T],W,I,E,M):-prefix(H,W),!,length(H,L),m(T,W,J,F,N),(N>L,!,M=N,I=J,E=F;M=L,length(T,D),I=D,E=H).
m([_|T],W,I,E,M):-m(T,W,I,E,M).

n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳

Posted 2014-02-05T13:08:50.827

Reputation: 5 683

GolfScript, 53 characters

n%~\1/\{.,,{)1$<2$?}%{)},-1=.p@.@=,:^)2$[<]+\^>.}do;;

The input must be given as two lines, first alphabet, second string to compress. The second example can be tested online.

Howard

Posted 2014-02-05T13:08:50.827

Reputation: 23 109

I'm currently working on a different approach which should be shorter - unfortunately not yet finished and I'm running out of time... – Howard – 2014-02-05T14:18:37.600

Our answers look quite similar apart from the identification of the longest known prefix. – Peter Taylor – 2014-02-05T15:40:48.897

Postscript, 388 267

It's my Perl answer re-written, just for fun:

/lzw {
    10 dict begin 
    /str exch def
    /alphabet exch def
    /concat
        {dup length 2 index length add string dup dup 4 index length 
        5 -1 roll putinterval 0 4 -1 roll putinterval}
    def
    /push
        {1 index load dup length /_n exch def 
        [exch aload pop _n 2 add -1 roll]def}
    def

    /dict [-1 alphabet length {
        1 add dup alphabet exch 1 getinterval exch
    } repeat pop] def
    [
    {
        str length 0 eq {exit} if
        dict length dup {
            1 sub dup
            dict exch get
            str exch anchorsearch
            {
                1 index
                dup length 0 ne {0 1 getinterval} if concat 
                /dict exch push
                /str exch def
                exit 
            } {pop} ifelse
        } repeat
    } loop
    ]

    end
} def

(ABC)(AABAACABBA) lzw
{10 string cvs print( )print} forall
(\n)print
(be torn)(to be or not to be) lzw
{10 string cvs print( )print} forall

and

gs -dBATCH -q lzw.ps
0 0 1 3 2 4 5
3 4 2 0 1 2 4 5 2 6 4 3 2 7 9 1

Procedure expects 2 strings on stack and returns an array. Not sure about quality of my Postscript, but with little help of 'binarization' (binary tokens), compression (LZW 4 bytes more efficient than Deflate) and encoding, it can be golfed to something like that:

/l<~J.+mR+f\'O00,h`aW[%a_[;,+b\+;Ei?VkM,`<7VO9Ho+MT&&IW4:>B/Jt:Fkp+9b5W"\IAS74n7,WEV`lsSZ&A6g1)4'7^,a$SAD@f9&15SkO))[/:$Z^e$4N0:?`F7fk8MV#<6.'Dr'AOkP"!5De'2kXm`>,I(@^ERI/e\Tneg#0M3_7tOQWJ$Y0n%F-#bWHFEC6hWX'hBaBu?;D~>/LZWDecode filter 500 string readstring pop cvx def

And that's 267 bytes procedure of pure ASCII code -- still not the longest of the answers :-)

user2846289

Posted 2014-02-05T13:08:50.827

Reputation: 1 541

Extended BrainFuck: 490 bytes

I know I won't win here, but I love compression algorithms. It was fun writing, not too difficult and shorter than I imagined :)

{l[>>]}{k[<<]}{j&k +&l }{i]>>[-<<+>>]}{h,10-[22-[-}{g&k>>[->>]}{f>]3<[-3>+3<]}{e>+3>&l+<<&k}{d&k>[-3<+4>&l}{c 48+[-<+<+>>]<[.[-]<]10+.[-]}{a 10+<[->-[>+>>]>[+[-<+>]>+>>]5<]>[-]}{b 5<&k&f>>&g}192>&h 4<&j>>]<[-3<&k>+>&l>>+<]3<&g>>+[-<+>],10-]<[-<+>]<[->+>+<<]3>,32-[-5<&j 3>]5<&d 3>+&b 4>&h 6<&j 4>]<5<&d 4>+<&b>[-<+3>[->&e 3<&i 4<]<[->+<]4>+[-&e<<]>-[-<+>]3>&l<[-3>+4<&k<+3>&l<]3>[-3<+3>]4<&k+<[<<[-&i>-<]>[-<<[-4<+4>]4>-4<&a 4>+<&c<<[->+5>&l<+<&k 4<]>+[-<+&f 6>]>>&l<<[-<<]<<[-],10-]&a 3>&c

Usage:

$ bf ebf.bf < lzw.ebf > lzw.bf
$ bf lzw.bf <<eof
> ABC
> AABAACABBA
> eof
00
00
01
03
02
04
05

To support codes above 99 I need to add a stack structure. Doable, but since both test cases don't even use the second digit I though I'd stop here. Here's the resulting BrainFuck code (1096):

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>,----------[----------------------[-<<<<[<<]+[>>
]>>]<[-<<<[<<]>+>[>>]>>+<]<<<[<<]>>[->>]>>+[-<+>],----------]<[-<+>]<[->+>+<<]>>
>,--------------------------------[-<<<<<[<<]+[>>]>>>]<<<<<[<<]>[-<<<+>>>>[>>]>>
>+<<<<<[<<]>]<<<[->>>+<<<]>>[<<]>>[->>]>>>>,----------[----------------------[-<
<<<<<[<<]+[>>]>>>>]<<<<<<[<<]>[-<<<+>>>>[>>]>>>>+<<<<<<[<<]>]<<<[->>>+<<<]>>[<<]
>>[->>]>[-<+>>>[->>+>>>[>>]+<<[<<]<<<]>>[-<<+>>]<<<<]<[->+<]>>>>+[->+>>>[>>]+<<[
<<]<<]>-[-<+>]>>>[>>]<[->>>+<<<<[<<]<+>>>[>>]<]>>>[-<<<+>>>]<<<<[<<]+<[<<[-]>>[-
<<+>>]>-<]>[-<<[-<<<<+>>>>]>>>>-<<<<++++++++++<[->-[>+>>]>[+[-<+>]>+>>]<<<<<]>[-
]>>>>+<++++++++++++++++++++++++++++++++++++++++++++++++[-<+<+>>]<[.[-]<]++++++++
++.[-]<<[->+>>>>>[>>]<+<[<<]<<<<]>+[-<+>]<<<[->>>+<<<]>>>>>>]>>[>>]<<[-<<]<<[-],
----------]++++++++++<[->-[>+>>]>[+[-<+>]>+>>]<<<<<]>[-]>>>+++++++++++++++++++++
+++++++++++++++++++++++++++[-<+<+>>]<[.[-]<]++++++++++.[-]

Keep in mind that I have aimed for short EBF code, not BF code. Here's the ungolfed EBF source code source code (3,5kB):

;; The data type for the lookup array
:lc the crumble for lookup array
:lv the value for this byte
:lz always zero

;; macros for lookup
;; move to open area
{to_lookup $lc(@lz)}

;; move back
{from_lookup $lz(@lc)}

{open_lookup (- &to_lookup+
                @lz &from_lookup)
}

{close_lookup &to_lookup
              $lz(-@lc)
}

;; makes a move to a higher 
;; crumble and increments it
{lookup_backup
   $lc@lz
   $lc+$lz@lc
}

;; restore backup
{lookup_restore
    $lc@lz
    $lc(-$lz@lc^0+$lc@lz)
    $lz@lc
}

;; opens the lookup with the 
;; index of the current register
;; and replaces it with the
;; code representing that 
{lookup_value
    &open_lookup
    &to_lookup
    $lv(- &lookup_backup
          &from_lookup
          ^0+
          &to_lookup)
    &lookup_restore
    &close_lookup
}

;; We have a possible alphabet of ASCII 32-126=94 and 
;; we need an empty as well *  2 slots/element = 190
190> @lz

;; The variables we use
:lec  lookup element count
:ndv  next dictionary value
:prev previos match
:cur  current index
:ax   general purpose register a

;; The trie tree data structure
:tz  the empty element
:tv  the value of this node
:tc  the crumble of this node

;; macros for the trie structure
{to_trie $tc(@tz)}

{from_trie $tz(@tc)}

;; makes a move to a higher 
;; crumble and increments it
{trie_backup
   $tc@tz
   $tc+$tz@tc
}

;; restore backup
{trie_restore
    $tc@tz
    $tc(-$tz@tc^0+$tc@tz)
    $tz@tc
}

{open_trie
  ;; we take prev times lec first
  ;; we use %lz as temporary for lec
  ;; and 
  $lec(-
        $lz+              ; backup lec
        $prev(-
               $ax+       ; backup prev 
               &to_trie+
               &from_trie
             )
        $ax(-$prev+)      ; restore backup
      )
  $lz(-$lec+)             ; restore lec
  $cur+(-
          $ax+ 
          &to_trie+
          &from_trie)
  $ax-(-$cur+)
}

{trie_close &to_trie $tz(-@tc)}


;; divmod divides ^0 with ^1
;; leaving remainder in ^2
;; and quotient in ^3
;; Needs up to ^5 for working area
{divmod
  (-^1-
      [^2+^4]
      ^5[*-3+[-^1+^2]^3+^5])
}

;; First we need to fill our lookup
;; we'll use ndv
$ndv,10-
( ; while not lf
  22-
  &open_lookup
  $lec(- &to_lookup $lv+
         &from_lookup $ndv+)
  &close_lookup
  $ndv+(-$lec+)
  %ndv , 10-
)

;; Now that we have all symbols we make a copy 
$lec(-$lz+)
$lz(-$lec+$ndv+)
#

$prev, 32- &lookup_value
$cur, 10-
(
  22-
  #&lookup_value
  #
  &open_trie

  ;; somehow if tv is we copy it to ax and reset prev
  ;; if prev is set we copy it to tv and 
  &to_trie
  $tv(-
        &trie_backup
        &from_trie
        $ax+
        &to_trie
     )
  &trie_restore
  &from_trie
  $tz+
  $ax( $prev(-)$ax(-$prev+)$tz-)
  $tz(-
      $cur(-$lz+) ; make copy of cur
      $tc-
      $cur 10+ $prev &divmod
      $cur(-)
      $tc+
      $tv 48+(-$tz+$ax+)
      $tz[.(-)<]@cur
      10+.(-)
      $ndv(- $prev+
             &to_trie $tv+
             &from_trie
          )
      $prev+(-$ndv+)
      $lz(-$prev+)           
   )
   &trie_close
   $cur(-), 10-
)
#$cur 10+ $prev &divmod
$cur(-)
$tv 48+(-$tz+$ax+)
$tz[.(-)<]@cur
10+.(-)

Sylwester

Posted 2014-02-05T13:08:50.827

Reputation: 3 678

Golf an LZW encoder

Input

Output

Algorithm

Examples

Rules

Answers

GolfScript (52 51 chars)

Dissection

Python (141 Characters)

Racket/R5RS Scheme: 359 bytes

PHP 220

Perl, 188 177 156 145 137 112

PHP: 176

SWI-Prolog, 313 244

Old version

GolfScript, 53 characters

Postscript, 388 267

Extended BrainFuck: 490 bytes