Palindrome Compression

15

1

Challenge

Write a program that compresses and decompresses ASCII text losslessly. It should be specialized to work well with palindromes, including case-insensitive and punctuation-insensitive palindromes. The best compression with the smallest source wins.

Scoring

total_bytes_saved / sqrt(program_size) - Highest score wins

total_bytes_saved is how many bytes smaller the compressed strings are than the originals, total across the test cases below. program_size is the size in bytes of the source code of both the compression and decompression programs. Code shared between the two need only be counted once.

For instance, if there are 10 test cases and a 100 byte program saved 5 bytes on 7 test cases, 10 each on 2 of them, but the last test case was 2 bytes longer, the solution would score 5.3. ((7 * 5 + 10 * 2 - 2) / sqrt(100) = 5.3)

Test Cases

  • tacocat
  • toohottohoot
  • todderasesareddot
  • amanaplanacanalpanama
  • wasitacaroracatisaw?
  • Bob
  • IManAmRegalAGermanAmI
  • DogeeseseeGod
  • A Santa at NASA
  • Go hang a salami! I'm a lasagna hog.

Rules

  1. Standard loopholes apply.
  2. The compression must work on all printable ASCII (bytes 32-126, inclusive) text strings, not just palindromes. It doesn't actually have to save space for any inputs, however.
  3. Output can be any sequence of bytes or characters, regardless of its implementation or internal representation (strings, lists, and arrays are all fair game, for instance). If encoding to UTF-8, count bytes, not characters. Wide strings (e.g. UTF-16 or UTF-32) are not allowed unless the only codepoints possibly used are between 0 and 255.
  4. Compression/Decompression builtins are not allowed.

For the sake of our own enjoyment, post the compressed strings with your source code.

UPDATE 1: Scoring changed from total_bytes_saved / program_size to total_bytes_saved / sqrt(program_size) in order to give more weight to better compression and less weight to aggressive golfing. Adjust your scores accordingly.

UPDATE 2: fixed wasitacaroraratisaw? to be wasitacaroracatisaw?

Beefster

Posted 2018-01-29T23:15:47.580

Reputation: 6 651

2If case, punctuation and spacing is removed from the the input, is it guaranteed that inputs will be strict palindromes? Edit: nevermind - I see the wasitacaroraratisaw? is a counterexample to that – Digital Trauma – 2018-01-29T23:57:57.370

2Which range of ASCII characters are we supposed to support in the input? Is it [32-126]? – Arnauld – 2018-01-30T07:14:11.297

1Yeah I don't think the 1000 * part is really needed, and no I don't think it will make the score feel more "satisfying" ;) – Erik the Outgolfer – 2018-01-30T11:31:37.093

1Can we use compression/decompression built-ins? – Lynn – 2018-01-30T13:37:54.140

@Arnauld: I want to say 0-127, but just the printables sounds fine. Gives you an extra 32 values you can use for compression control codes. So yeah, [32-126] – Beefster – 2018-01-30T19:06:55.420

@Lynn I'm going to say no on that. – Beefster – 2018-01-30T19:09:25.660

@DLosc: I didn't realize I didn't have an even-length palindrome in there. Good catch. (if only someone had noticed in the sandbox) – Beefster – 2018-01-31T00:14:54.940

4With so few inputs, there is not much scope for doing anything clever. It would be nice to have at least a few times more. – user1502040 – 2018-01-31T08:16:13.120

1Here are a few more test cases I think would be a good addition–two longer all-letter palindromes, a long mixed palindrome, a very long, very mixed palindrome, and a short non-palindrome (to ensure entries can work on non-palindromes). – ETHproductions – 2018-01-31T15:31:12.813

I'm thinking I need to change the scoring to give more weight to the bytes saved. Maybe I should sqrt the program length... – Beefster – 2018-01-31T18:11:51.410

I like this scoring system much better. (Also, I only noticed now that wasitacaroraratisaw? is not a palindrome... sneaky sneaky!) – ETHproductions – 2018-01-31T19:20:52.153

@ETHproductions Oh. You're right. I actually missed that. It actually should be wasitacaroracatisaw? I remembered the palindrome wrong. – Beefster – 2018-01-31T20:08:20.290

Incidentally, DigitalTrauma noticed that one and I didn't realize it wasn't actually a palindrome even after he pointed it out. – Beefster – 2018-01-31T20:12:14.413

@Beefster Oh :P Compressed 6 more bytes for free, thanks! – ETHproductions – 2018-01-31T23:31:56.443

1A man, a plan, a God's 'nam. Tables, nitrate, tar, tinsel, Batman's dog: Anal Panama -- 1632 – SIGSTACKFAULT – 2018-02-01T22:02:22.897

@Blacksilver my own solution saved 22 bytes on that one. – Beefster – 2018-02-01T23:38:37.390

Answers

16

JavaScript (ES6), 3.143 (81 bytes saved, 664 byte program)

R='replace',S=String.fromCharCode,T=c=>c.charCodeAt(),U='toUpperCase',V='0000000',W=(a,b,c=2)=>a.toString(c).slice(b),X=x=>'0b'+x,Y=a=>[...a].reverse().join``,Z=/[^]/g
C=s=>S(...((Y(q=s[U]()[R](/[^A-Z]/g,m=''))==q?(q=q.slice(0,p=-~q.length/2),p%1&&10):11)+q[R](Z,x=>W(T(x),2))+111+s[R](Z,c=>/[a-z]/.test(c)?W("00",m,m=1):m+(/[A-Z]/.test(c,m='')?"01":W(c<'!'?2:T(c)+384)))+V).match(/(?!0+$).{8}/g).map(X))
D=s=>{s=s[R](Z,c=>W(256+T(c),1))+V;M=r=>(s=s[R](p=s.match(`^${r}|`)[0],''),p);for([,a]=M`1.|0`,t=u=i='';!M`111`;)t+=W(X(M`.{5}`)-~8,0,36);for(t+=W(Y(t),a?a/0:1);p;)u+=M`0(?=00)|00?1`?(c=t[i++])?+p[1]?c[U]():c:'':M`10`?' ':M`11`&&S(X(M`.{7}`));return u+W(t,i)}

Now that I'm fairly satisfied with this program (and the scoring system), I'll write a bit of an explanation.

The basic idea is to compress the input into a string of bits, then compress each set of 8 bits into a byte. For the purposes of explanation, I'll just manipulate the bit string.

The bit string can be separated into several sections:

input  -> Taco Cat.
output -> 0101000000100011011111110100001100100011101011100000000

0      | 10100 00001 00011 01111 111 | 01 00001 10 01 0001 110101110
header | letter data                 | styling data

The header is a very simple mapping:

0  -> odd-length palindrome
10 -> even-length palindrome
11 -> non-palindrome

Letter data is also fairly simple. First, all non-letters are extracted from the string, and all letters are converted to uppercase. If the resulting string is a palindrome, the reversed half is stripped. Then this mapping is applied:

A -> 00001
B -> 00010
C -> 00011
D -> 00100
...
Z -> 11010

This section is terminated with 111. After that comes the styling data, which stores upper/lower-case data and non-letters. This works like so:

01 -> next letter as uppercase
0...01 (n 0s) -> next (n-1) letters as lowercase
10 -> space
11xxxxxxx -> character with code point 0bxxxxxxx

So going through the example shown above, we have

header: 0 -> palindrome
letter data: 10100 00001 00011 01111 111 -> taco
styling data:
  01        -> T
  00001     -> aco
  10        -> <space>
  01        -> C
  0001      -> at
  110101110 -> .

When the end of the bit string is reached, all remaining characters from the letter data are appended to the result. This saves us from having to do one last 000...001 and allows us to truncate these bits from the string.

Going through the test cases:

tacocat -> 3 bytes (-4)
    24 bits: 010100000010001101111111
toohottohoot -> 5 bytes (-7)
    35 bits: 10101000111101111010000111110100111
todderasesareddot -> 7 bytes (-10)
    49 bits: 0101000111100100001000010110010000011001100101111
amanaplanacanalpanama -> 8 bytes (-13)
    59 bits: 00000101101000010111000001100000110000001011100000100011111
wasitacaroracatisaw? -> 11 bytes (-9)
    84 bits: 010111000011001101001101000000100011000011001001111111000000000000000000001110111111
Bob -> 2 bytes (-1)
    16 bits: 0000100111111101
IManAmRegalAGermanAmI -> 13 bytes (-8)
    98 bits: 00100101101000010111000001011011001000101001110000101100111010100010100101000001010100000010100101
DogeeseseeGod -> 7 bytes (-6)
    54 bits: 000100011110011100101001011001100101111010000000000101
A Santa at NASA -> 8 bytes (-7)
    63 bits: 100000110011000010111010100000011110110010000011000011001010101
Go hang a salami! I'm a lasagna hog. -> 20 bytes (-16)
   154 bits: 1000111011110100000001011100011100001100110000101100000010110101001111010011000000110001100000000111010000110011101001110011000110000000001100000111010111

ETHproductions

Posted 2018-01-29T23:15:47.580

Reputation: 47 880

Wow. I'm really impressed by this approach. I would have never thought to make a bit-packed encoding like this. (I did think of the case of packing ASCII into 7 bits, but doesn't save much space for palindromes) I'm impressed that you managed to save space on Bob as well. – Beefster – 2018-01-31T20:04:55.033

4This is a great example of the fundamentals of engineering. Taking a problem description, thinking about different ways to solve it, and making tradeoffs between requirements (ie how many bits to dedicate to various styles), etc – Robert Fraser – 2018-01-31T23:22:10.537

@Beefster Thanks :-) Bob really just fell into place--1 bit for the header, 10+3 bits for the two letters, and 2 bits for the single uppercase letter. Couldn't get it any shorter if I tried my hardest... – ETHproductions – 2018-01-31T23:33:37.743

I'm not entirely sure, but can't -~8 be golfed to +9? Or is ~ used differently here than ~x-x-1? No idea if that single saved byte would even increase the score at all though, but still.. ;) – Kevin Cruijssen – 2018-02-05T10:27:25.640

1@KevinCruijssen the problem is that the thing being added to is a string, so it has to converted to a number first. This way is a byte shorter than -0+9 – ETHproductions – 2018-02-05T12:49:22.283

1@ETHproductions Ah of course (didn't notice it was a string)! +9 would concat to the string, while -~8 would do +9 arithmetically (since - doesn't do anything for strings, so it interpret it as a number). In that case -~8 is pretty smart. :) Nice answer btw, +1 from me! Very smart storing all the information in bits like that, even saving a byte on Bob. – Kevin Cruijssen – 2018-02-05T12:53:06.427

2

Python 2: 2.765 (70 bytes saved, 641 byte program)

I changed my approach a little. It now works well on imperfect palindromes. There are no compressed strings that will be longer than the input. Perfect even-length palindromes will always compress to 50% the original size.

A=lambda x:chr(x).isalpha()
def c(s):
 r=bytearray(s);q=len(r);L=0;R=q-1;v=lambda:R+1<q and r[R+1]<15
 while L<=R:
  while not A(r[L])and L<R:L+=1
  while not A(r[R])and R:
   if v()and r[R]==32:r[R]=16+r.pop(R+1)
   R-=1
  j=r[L];k=r[R]
  if A(j)*A(k):
   if L!=R and j&31==k&31:
    r[L]+=(j!=k)*64;r[R]=1
    if v():r[R]+=r.pop(R+1)
   else:r[L]|=128;r[R]|=128
  L+=1;R-=1
 while r[-1]<16:r.pop()
 return r
def d(s):
 r='';t=[]
 for o in s:
  if 15<o<32:r+=' ';o-=16
  while 0<o<16:r+=chr(t.pop());o-=1
  if o==0:continue
  if 127<o<192:o-=64;t+=[o^32]
  elif o>192:o-=128
  elif A(o):t+=[o]
  r+=chr(o)
 while t:r+=chr(t.pop())
 return r

Results

'tacocat' <==> 'tac\xef'
4/7 (3 bytes saved)
'toohottohoot' <==> 'toohot'
6/12 (6 bytes saved)
'todderasesareddot' <==> 'todderas\xe5'
9/17 (8 bytes saved)
'amanaplanacanalpanama' <==> 'amanaplana\xe3'
11/21 (10 bytes saved)
'wasitacaroracatisaw?' <==> 'wasita\xe3ar\xef\x09?'
12/20 (8 bytes saved)
'Bob' <==> '\x82\xef'
2/3 (1 bytes saved)
'IManAmRegalAGermanAmI' <==> 'I\x8d\xa1n\x81m\x92e\xa7\xa1\xec'
11/21 (10 bytes saved)
'Dogeeseseegod' <==> '\x84ogees\xe5'
7/13 (6 bytes saved)
'A Santa at NASA' <==> 'A S\xa1\xaeta\x12\x14'
9/15 (6 bytes saved)
"Go hang a salami! I'm a lasagna hog." <==> "\x87o hang a salam\xa9!\x11'\x01\x11\x17\x13."
24/36 (12 bytes saved)

And as a bonus, it saves 6 bytes on my incorrect palindrome I had before.

'wasita\xe3ar\xef\x02\xf2\x06?' <==> 'wasitacaroraratisaw?'
6 bytes saved

Explanation

Decompression uses a stack. Codepoints from 32-127 are treated literally. If a character is a letter, a value is pushed onto the stack as well. Values 128-192 are used for case flipped letters, so the caseflipped letter (o^32 because of how ASCII is laid out) gets pushed onto the stack and the normal letter gets added to the string. Values 192-255 are used to add letters without pushing to the stack, so this is used when letters do not match and for the middle letter in odd-length palindromes. Codepoints 1-15 indicate that the stack should be popped that number of times. Codepoints 17-31 are similar, but they print a space first before popping from the stack. There is also an implicit "pop until empty" instruction at the end of an input.

The compressor works from both ends and folds in matching letters as values 1-31. It skips over non-letters. When the letters match but the case does not, it adds 64 to the left letter and increments the right letter. This allows it to save space on IManAmRegalAGermanAmI. At the middle or when the letters do not match, it ors 128 to both sides. I can't add there because I need to avoid the special case where left == right. When folding neighboring pop markers on the right side, I have to check that the neighboring one won't overflow into codepoint 16 because I need that for spaces. (This is not actually an issue for any of the test case strings)

EDIT 1: No more ungolfed version.

Beefster

Posted 2018-01-29T23:15:47.580

Reputation: 6 651

1

Python3, 1.833 (25 bytes saved, 186 byte program)

Just simple 0-order equal-probability entropy coding. No palindrome-specific optimizations.

def C(s):
    u=0
    for c in s:u=u*96+ord(c)-31
    return u.to_bytes((u.bit_length()+7)//8,'big')
def D(a):
    u,s=int.from_bytes(a,'big'),''
    while u:s,u=s+chr((u%96)+31),u//96
    return s[::-1]

user1502040

Posted 2018-01-29T23:15:47.580

Reputation: 2 196

0

Java 8, score: 1.355 (20 bytes saved / 218 (107+111) bytes)

Compress function (contains three unprintable ASCII characters):

s->{int l=s.length();return s.contains(new StringBuffer(s).reverse())?s.substring(l/2)+(l%2<1?"":""):s;}

Decompress function (contains two unprintable ASCII characters):

s->{return s.contains("")?new StringBuffer((s=s.replaceAll("","")).substring(s.length()&1^1)).reverse()+s:s;}

Explanation:

Try it online.

Only compresses perfect palindromes.

s->{                      // Method with String as both parameter and return-type
  int l=s.length();       //  Get the length of the input
  return s.contains(new StringBuffer(s).reverse())?
                          //  If the input is a palindrome:
    s.substring(l/2)      //   Only return the second halve of the String
    +(l%2<1?"":"")        //   + either one (if even) or two (if odd) unprintables 
   :                      //  Else:
    s;}                   //   Simply return the input again

s->{                      // Method with String as both parameter and return-type
  return s.contains("")?  //  If the input contains an unprintable:
    new StringBuffer((s=s.replaceAll("",""))
                          //   Remove the unprintables
                     .substring(s.length()&1^1))
                          //   And take either the full string (if even),
                          //   or minus the first character (if odd)
    .reverse()            //    And reverse that part
    +s                    //   And append the rest of the input (minus the unprintables)
   :                      //  Else:
    s;}                   //   Simply return the input again

Kevin Cruijssen

Posted 2018-01-29T23:15:47.580

Reputation: 67 575