Code the Huffman!

13

Or else he will huff and puff and blow your house down!

That was completely irrelevant. This challenge is actually about Huffman coding. The gist of it is the frequency of characters in a given text is utilized to make its representation shorter. In other words, let's say that our alphabet is a through z and space. That's 27 characters. Each of them can be uniquely encoded in just 5 bits because 5 bits have enough room for 32 characters. However, in many situations (like English, or languages in general), some characters are more frequent than others. We can use fewer bits for the more frequent characters and (perhaps) more bits for the less frequent characters. Done right, there is an overall savings in the number of bits and the original text can still be uniquely reconstructed.

Let's take "this question is about huffman coding" as an example. This text is 37 characters long, which would be 37*8 = 296 bits normally, though only 37*5 = 185 bits if we only use 5 bits for each character. Keep that in mind.

Here is a (sorta) table of each character and their frequencies in the text, sorted from most to least frequent (where _ stands in for a space):

_ 5
i 4
n 3
o 3
s 3
t 3
u 3
a 2
f 2
h 2
b 1
c 1
d 1
e 1
g 1
m 1
q 1

An associated optimal coding could be:

_ 101
i 011
n 1100
o 1101
s 1110
t 1111
u 001
a 10011
f 0001
h 0101
b 00000
c 00001
d 01000
e 01001
g 10000
m 10001
q 10010

It should be immediately clear that this will be a better encoding than just using 5 bits for every character. Let's find out just how much better, though!

145 bits, compared with 185! That's a savings of 40 bits, or just over 20%! (This is, of course, assuming that information about the structure is available for decoding.) This coding is optimal because no more bits can be dropped by changing any character's representation.

The task

  • Write a program or function with one parameter that...
  • Takes input from STDIN (or equivalent) or as a single argument.
  • Output an optimal Huffman coding as above with the characters sorted by frequency (order within a frequency class doesn't matter).
  • You may assume that the characters in the input are restricted to the ASCII range 32..126 plus a newline.
  • You may assume the input is no longer than 10,000 characters (ideally, in theory, input should be unbounded).
  • Your code should finish reasonably fast. The given example above should take no more than a minute or so at worst. (This is intended to rule out brute force.)
  • Scoring is in bytes.

Examples

x
---
x 0

xxxxxxxxx
---
x 0

xxxxxxxxy
---
x 0
y 1 (these may be swapped)

xxxxxyyyz
---
x 0
y 10
z 11

uuvvwwxxyyzz
---   (or) 
u 000      000
v 001      001
w 100      010
x 101      011
y 01       10
z 11       11

this question is about huffman coding
---
  101
i 011
n 1100
o 1101
s 1110
t 1111
u 001
a 10011
f 0001
h 0101
b 00000
c 00001
d 01000
e 01001
g 10000
m 10001
q 10010

Happy coding!


Note that this similar question is closely related, even to the point that this one is a duplicate. However, the consensus so far on Meta is that the older one should be considered a duplicate of this one.

El'endia Starman

Posted 2015-10-27T05:58:59.537

Reputation: 14 504

1I disagree with your note: it's the same question which for the existing answers requires a simple transformation of output format, and moreover any answer to this question is automatically an answer to the previous question. – Peter Taylor – 2015-10-27T07:00:16.993

@PeterTaylor: I'd like to petition again that you reopen this question. The specification in this one is better (as said by Martin) and I want to see newer, better answers, including Pyth and CJam answers. I think there's no harm in leaving both questions open because they are sufficiently different. Only two of five users who posted on that question have been on this site recently. – El'endia Starman – 2015-10-28T04:04:25.443

@PeterTaylor: Also, going by this standard, I'd like to say that I don't think answers could be copied between questions and remain competitive. Finally, the other question is four years old. It'd be good to have a fresh version.

– El'endia Starman – 2015-10-28T04:08:51.997

In your example of this question is about huffman coding, I counted the number of bits to be 145, not 136. – TFeld – 2015-10-29T12:23:38.097

1

I was really trying to complete this challenge in Spoon, but after 2 hours of brainfuckery I decided it would be best to give up...

– Bassdrop Cumberwubwubwub – 2015-10-29T15:08:12.490

I also got 145 bits, not 139 bits. – isaacg – 2015-10-29T18:07:17.810

@TFeld: You're right. I've fixed it. – El'endia Starman – 2015-10-29T19:41:54.723

@isaacg: You're right, my mistake. – El'endia Starman – 2015-10-29T19:42:19.573

Employing a suitable strategy for transmitting the dictionary along with the compressed data is an important and often over-looked aspect of compression. A poor strategy ruins many implementations.....

Also, how on earth can something be deemed a duplicate of something created many years later? – enhzflep – 2015-10-30T21:46:30.143

@enhzflep: Read the linked Meta thread. – El'endia Starman – 2015-10-30T21:58:03.337

@El'endiaStarman - Ah, gotcha. It's the only way to use the limited tools of the SO network to achieve the desired result. I agree that this is the better formulated challenge. The anal part of me simply has a problem with closing "as a duplicate" along with the text "already has an answer here". I guess the chosen option (dup) is the one that fits least badly. – enhzflep – 2015-10-31T03:59:54.513

Answers

2

Pyth, 53 bytes

jo_/zhNee.WtHa>JohNZ2+shKC<J2]s.b+RYNeKU2m,/zd]+d\ {z

Demonstration

Here's a version that shows the internal state, so you can see the encoding being built:

jo_/zhNee.WtHvp+`a>JohNZ2+shKC<J2]s.b+RYNeKU2bm,/zd]+d\ {z

Demonstration

Copy the output to an environment with wider lines for a clearer picture.

isaacg

Posted 2015-10-27T05:58:59.537

Reputation: 39 268

4

Python 2, 299 bytes

Here's my attempt at an answer.

The Huffman codes are different from the examples given, but should still be optimal.

i=raw_input();m=n=[(c,i.count(c))for c in set(i)]
while n[1:]:n.sort(key=lambda x:(x[1]));(a,b),(c,d)=n[:2];n=[((a,c),b+d)]+n[2:]
n=n[0][0]
r=[]
def a(b,s):
 if b[1:]:a(b[0],s+'0');a(b[1],s+'1')
 else:r.append(b+(s if s[1:]else s+'0'))
a(n,' ')
for y in sorted(r,key=lambda x:-dict(m)[x[0]]):print y

TFeld

Posted 2015-10-27T05:58:59.537

Reputation: 19 246

2

Matlab, 116 bytes

tabulate makes a frequency table. huffmandict takes a list of symbols and probabilities for each symbol, and calculates the code.

t=tabulate(input('')');
d=huffmandict(t(:,1),cell2mat(t(:,3))/100);
for i=1:size(d,1);disp([d{i,1},' ',d{i,2}+48]);end

flawr

Posted 2015-10-27T05:58:59.537

Reputation: 40 560

2

Ruby, 189 180 bytes

Work in progress.

->s{m=s.chars.uniq.map{|c|[c,s.count(c)]}
while m[1]
(a,x),(b,y),*m=m.sort_by &:last
m<<[[a,b],x+y]
end
h={}
f=->q="",c{Array===c&&f[q+?0,c[0]]&&f[q+?1,c[1]]||h[c]=q}
f[m[0][0]]
h}

It's an anonymous function; assign it to something, for example f, and call it with

f["some test string"]`

which returns a hash like this:

{"t"=>"00", "g"=>"0100", "o"=>"0101", " "=>"011", "e"=>"100", "n"=>"1010", "i"=>"1011", "m"=>"1100", "r"=>"1101", "s"=>"111"}

daniero

Posted 2015-10-27T05:58:59.537

Reputation: 17 193

1

K (ngn/k), 78 bytes

{h::0#'x;(#1_){{h[x],:!2;y,,,/x}.0 2_x@<#'x}/.=x;(?,/'x,'" ",'|'$h)(?x)?>#'=x}

Try it online!

returns a list of strings for printing

h::0#'x creates an empty list for each character (technically, it reshapes each character to length 0). we will store the reversed huffman codes there. we use :: instead of : for assignment to make h global so it's visible in sub-functions.

.=x is a list of lists - the indices of the string grouped by character value

(#1_) is a function that returns truthy iff the length of the argument is >1 (technically "length of 1 drop of ...")

(#1_){...}/ means: while the argument has length >1, keep applying the curly-brace function

x@<#'x sort the argument by length

0 2_ cut it into a 2-element head and a tail

{h[x],:!2;y,,,/x} update h by appending 0 and 1 to the indices contained in the head; return the tail with the head as a single element

(?,/'x,'" ",'|'$h)(?x)?>#'=x reverse each of h, sort, unique, prepend corresponding characters, and format nicely

ngn

Posted 2015-10-27T05:58:59.537

Reputation: 11 449

1

Haskell, 227 bytes

import Data.List
s=sortOn.(length.)
f x|[c]<-nub x=[(c,"0")]|1<2=g[(a,[(a!!0,"")])|a<-group$sort x]
g=h.s fst
h[x]=snd x
h((a,b):(c,d):e)=g$(a++c,map('0'#)b++map('1'#)d):e
n#(a,b)=(a,n:b)
p=unlines.map(\(a,b)->a:" "++b).s snd.f

Usage example:

*Main> putStr $ p "this question is about huffman coding"
u 000
i 011
  101
a 0010
f 0011
h 1000
s 1100
t 1101
n 1110
o 1111
d 01000
e 01001
b 01010
c 01011
q 10010
g 100110
m 100111

How it works:

p calls f which builds the Huffman table in the form of a list of (character,encoding)-pairs, e.g. [ ('a',"0"), ('b',"1") ], sorts the table by length of encodings, formats each pair for output and joins with newlines in-between.

f first checks the single letter case and returns the corresponding table. Otherwise it sorts the input string and groups sequences of equal characters (e.g. "ababa" -> ["aaa","bb"]) and maps them to pairs (sequence , [(char, "")]), (-> [ ("aaa", [('a',"")]), ("bb", [('b', "")])]. The first element is used to keep track of the frequency, the second element is a list of pairs of a character and it's encoding (which is initially empty). These are all single element Huffman tables as expected by p and are combined by g and h.

g sorts the list of pairs on the length of the first element, i.e. the frequency and calls h. h combines the Huffman tables of the first two elements, by concatenating the frequencies and putting a 0 (1) in front of every element of the first (second) table. Concatenate both tables. Call g again, stop when there's a single element left, throw away the frequency part and return the full Huffman table.

nimi

Posted 2015-10-27T05:58:59.537

Reputation: 34 639

0

JavaScript (ES6) 279

Essentially, the basic algorithm from Wikipedia. I probably can do better.

f=s=>{for(n=[...new Set(s)].map(c=>({c:c,f:[...s].filter(x=>x==c).length}));n[1];n.push({l:a=n.pop(),r:b=n.pop(),f:a.f+b.f,c:a.c+b.c}))n.sort((a,b)=>b.f-a.f);t=(n,b)=>(n.b=b,n.r)?(t(n.l,b+0),t(n.r,b+1)):o.push(n);t(n[0],'',o=[]);return o.sort((a,b)=>b.f-a.f).map(x=>x.c+' '+x.b)}

More readable inside the snippet below

f=s=>{
  for(n=[...new Set(s)].map(c=>({c:c,f:[...s].filter(x=>x==c).length}));
      n[1];
      n.push({l:a=n.pop(),r:b=n.pop(),f:a.f+b.f,c:a.c+b.c}))
    n.sort((a,b)=>b.f-a.f);
  t=(n,b)=>(n.b=b,n.r)?(t(n.l,b+0),t(n.r,b+1)):o.push(n);
  t(n[0],'',o=[]);
  return o.sort((a,b)=>b.f-a.f).map(x=>x.c+' '+x.b)
}

//TEST
console.log=x=>O.innerHTML+=x+'\n'

test=['xxxxxxxxy','uuvvwwxxyyzz','this question is about huffman coding']
.forEach(t=>console.log(t+'\n'+f(t).join`\n`+'\n'))
<pre id=O></pre>

edc65

Posted 2015-10-27T05:58:59.537

Reputation: 31 086