Golfing Strings

22

1

I have always failed to give an answer for challenges which require string compression, the main reason being that I don't know to use string compression tools as effectively as I should.

For this reason, I have posted this question. Unlike my other tips questions, this is not language specific meaning that if you can think of any tips in your own language, then you may post it (providing that you specify the language). General tips are also appreciated.

So, how can I use string compression tools to their maximum effectiveness?

Beta Decay

Posted 2015-10-01T06:08:20.100

Reputation: 21 478

Answers

9

Base conversion (CJam)

An easy way to encode ASCII strings that do not start with a null byte is to convert from base 128 to integer, then to base 256:

128b256b:c              e# Prints encoded string.
128b256b:c`"256b128b:c" e# Prints encoded string with decoder.

This uses 7 bits to encode each ASCII character.

If the original string consists only of, e.g., lowercase letters, and does no start with an a, we can start by mapping "a...z" to [0 ... 25], then proceed as above:

'afm26b256b:c               e# Prints encoded string.
'afm26b256b:c`"256b26b'af+" e# Prints encoded string with decoder.

Finally, if the original string has only a few unique characters (common in ASCII art), it is usually better to specify the alphabet explicitly.

For example:

" +-/\|"f#6b256b:c                       e# Prints encoded string.
" +-/\|"f#6b256b:c`"256b6b"" +-/\|"`"f=" e# Prints encoded string with decoder.

As a rule of thumb, you want the first character of the original string to be the second character of the alphabet, the next distinct character of the original string to be the first character of the alphabet, the next distinct character of the original string to be the third character of the alphabet, the next distinct character of the original string to be the fourth character of the alphabet, etc.

The encoder of last example works as follows:

" +-/\|"f# e# Replace each character by its index in that string.
6b256b     e# Convert from base 6 (length of the alphabet) to base 256.
:c         e# Cast each digit to character.

The decoder of last example works as follows:

256b6b     e# Convert from base 256 to base 6.
" +-/\|"f= e# Replace each digit by the corresponding character of the alphabet.

Dennis

Posted 2015-10-01T06:08:20.100

Reputation: 196 637

2I would be more specific: as a rule of thumb you want the first character of the original string to be the second character of the alphabet, the next distinct character of the original string to be the first character of the alphabet, ... – Peter Taylor – 2015-10-01T07:24:48.140

@PeterTaylor Added. Thanks! – Dennis – 2015-10-01T07:30:08.780

9

Larger Kolmogorov complexity questions with some structure but no simple formula (e.g. song lyrics) will typically benefit from a grammar-based approach. In essence, you extract repeated substrings and encode them somehow. This is what Lempel-Ziv does, using a fairly restricted class of grammars; if you use more general grammars then you have to figure out how to encode the rules. E.g. one approach here is "offset encoding", where you offset each source byte by the number of rules (n), assign bytes 1 to n to the rules, use the 0 byte to separate rules, and repeatedly replace byte i with the evaluated rule i. Finally you undo the offset by subtracting n from each byte.

I have actually written a Java program which implements various approaches:

Most of the approaches follow a two-phase process. In the first phase the string is converted into a grammar which generates it; in the second phase, the grammar is converted into a GolfScript program. The first phase implementations are largely based on Charikar, Lehman, Liu, Panigrahy, Prabhakaran, Sahai, & Shelat (2005) The smallest grammar problem, Information Theory, IEEE Transactions on, 51(7), 2554-2576.

It also includes a Lempel-Ziv approach, a base encoding approach, and a runlength-encoding approach, and identifies the one which gives the shortest program.

Peter Taylor

Posted 2015-10-01T06:08:20.100

Reputation: 41 901

0

Stax

In the Stax code golfing language, there's a helpful little tool called the string literal compressor. I don't know how it works, exactly, but there is another where I do know how it works. It converts strings into numbers, then into Base 256. It's CP437, with 0x00 and 0xFF converted for copying. It's PackedStax. You could convert your strings with the string literal compressor then Pack it, for some good compression.

Using this process the string "This string is thirty-two bytes" can be converted to v*"A]-|W4]}3"% (the compressed string is usually surrounded by backticks to tell the difference between a normal string in Stax) and finally to üvìë![┴╩qJu←▓α for a compression / reduction of 18 bytes, more than half.

Ethan Slota

Posted 2015-10-01T06:08:20.100

Reputation: 85