13
Or else he will huff and puff and blow your house down!
That was completely irrelevant. This challenge is actually about Huffman coding. The gist of it is the frequency of characters in a given text is utilized to make its representation shorter. In other words, let's say that our alphabet is a
through z
and space. That's 27 characters. Each of them can be uniquely encoded in just 5 bits because 5 bits have enough room for 32 characters. However, in many situations (like English, or languages in general), some characters are more frequent than others. We can use fewer bits for the more frequent characters and (perhaps) more bits for the less frequent characters. Done right, there is an overall savings in the number of bits and the original text can still be uniquely reconstructed.
Let's take "this question is about huffman coding" as an example. This text is 37 characters long, which would be 37*8 = 296 bits normally, though only 37*5 = 185 bits if we only use 5 bits for each character. Keep that in mind.
Here is a (sorta) table of each character and their frequencies in the text, sorted from most to least frequent (where _ stands in for a space):
_ 5
i 4
n 3
o 3
s 3
t 3
u 3
a 2
f 2
h 2
b 1
c 1
d 1
e 1
g 1
m 1
q 1
An associated optimal coding could be:
_ 101
i 011
n 1100
o 1101
s 1110
t 1111
u 001
a 10011
f 0001
h 0101
b 00000
c 00001
d 01000
e 01001
g 10000
m 10001
q 10010
It should be immediately clear that this will be a better encoding than just using 5 bits for every character. Let's find out just how much better, though!
145 bits, compared with 185! That's a savings of 40 bits, or just over 20%! (This is, of course, assuming that information about the structure is available for decoding.) This coding is optimal because no more bits can be dropped by changing any character's representation.
The task
- Write a program or function with one parameter that...
- Takes input from STDIN (or equivalent) or as a single argument.
- Output an optimal Huffman coding as above with the characters sorted by frequency (order within a frequency class doesn't matter).
- You may assume that the characters in the input are restricted to the ASCII range
32..126
plus a newline. - You may assume the input is no longer than 10,000 characters (ideally, in theory, input should be unbounded).
- Your code should finish reasonably fast. The given example above should take no more than a minute or so at worst. (This is intended to rule out brute force.)
- Scoring is in bytes.
Examples
x
---
x 0
xxxxxxxxx
---
x 0
xxxxxxxxy
---
x 0
y 1 (these may be swapped)
xxxxxyyyz
---
x 0
y 10
z 11
uuvvwwxxyyzz
--- (or)
u 000 000
v 001 001
w 100 010
x 101 011
y 01 10
z 11 11
this question is about huffman coding
---
101
i 011
n 1100
o 1101
s 1110
t 1111
u 001
a 10011
f 0001
h 0101
b 00000
c 00001
d 01000
e 01001
g 10000
m 10001
q 10010
Happy coding!
Note that this similar question is closely related, even to the point that this one is a duplicate. However, the consensus so far on Meta is that the older one should be considered a duplicate of this one.
1I disagree with your note: it's the same question which for the existing answers requires a simple transformation of output format, and moreover any answer to this question is automatically an answer to the previous question. – Peter Taylor – 2015-10-27T07:00:16.993
@PeterTaylor: I'd like to petition again that you reopen this question. The specification in this one is better (as said by Martin) and I want to see newer, better answers, including Pyth and CJam answers. I think there's no harm in leaving both questions open because they are sufficiently different. Only two of five users who posted on that question have been on this site recently. – El'endia Starman – 2015-10-28T04:04:25.443
@PeterTaylor: Also, going by this standard, I'd like to say that I don't think answers could be copied between questions and remain competitive. Finally, the other question is four years old. It'd be good to have a fresh version.
– El'endia Starman – 2015-10-28T04:08:51.997In your example of
this question is about huffman coding
, I counted the number of bits to be 145, not 136. – TFeld – 2015-10-29T12:23:38.0971
I was really trying to complete this challenge in Spoon, but after 2 hours of brainfuckery I decided it would be best to give up...
– Bassdrop Cumberwubwubwub – 2015-10-29T15:08:12.490I also got 145 bits, not 139 bits. – isaacg – 2015-10-29T18:07:17.810
@TFeld: You're right. I've fixed it. – El'endia Starman – 2015-10-29T19:41:54.723
@isaacg: You're right, my mistake. – El'endia Starman – 2015-10-29T19:42:19.573
Employing a suitable strategy for transmitting the dictionary along with the compressed data is an important and often over-looked aspect of compression. A poor strategy ruins many implementations.....
Also, how on earth can something be deemed a duplicate of something created many years later? – enhzflep – 2015-10-30T21:46:30.143
@enhzflep: Read the linked Meta thread. – El'endia Starman – 2015-10-30T21:58:03.337
@El'endiaStarman - Ah, gotcha. It's the only way to use the limited tools of the SO network to achieve the desired result. I agree that this is the better formulated challenge. The anal part of me simply has a problem with closing "as a duplicate" along with the text "already has an answer here". I guess the chosen option (dup) is the one that fits least badly. – enhzflep – 2015-10-31T03:59:54.513