Implement bzip2's run-length encoding

14

1

Background

After applying the BWT (as seen in Burrows, Wheeler and Back) and the MTF (as seen in Move to the printable ASCII front), the bzip2 compressor applies a rather unique form of run-length encoding.

Definition

For the purpose of this challenge, we define the transformation BRLE as follows:

Given an input string s that consists solely of ASCII characters with code points between 0x20 and 0x7A, do the following:

  1. Replace each run of equal characters by a single occurrence of the character and store number of repetitions after the first.

  2. Encode the number of repetitions after the first occurrence of the character, using bijective base-2 numeration and the symbols { and }.

    A non-negative integer n is encoded as the string bk…b0 such that n = 2ki(bk)+…+20i(b0), where i({) = 1 and i(}) = 2.

    Note that this representation is always unique. For example, the number 0 is encoded as an empty string.

  3. Insert the string of curly brackets that encodes the number of repetitions after the single occurrence of the corresponding character.

Step-by-step example

Input:  "abbcccddddeeeeeffffffggggggghhhhhhhh"
Step 1: "abcdefgh" with repetitions 0, 1, 2, 3, 4, 5, 6, 7
Step 2: "" "{" "}" "{{" "{}" "}{" "}}" "{{{"
Step 3: "ab{c}d{{e{}f}{g}}h{{{"

Task

Implement an involutive program or function that reads a single string from STDIN or as command-line or function argument and prints or returns either the BRLE or its inverse of the input string.

If the input contains no curly brackets, apply the BRLE. If the input contains curly brackets, apply its inverse.

Examples

INPUT:  CODEGOLF
OUTPUT: CODEGOLF

INPUT:  PROGRAMMING
OUTPUT: PROGRAM{ING

INPUT:  PUZ{LES
OUTPUT: PUZZLES

INPUT:  444488888888GGGGGGGGGGGGGGGGWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
OUTPUT: 4{{8{{{G{{{{W{{{{{

INPUT:  y}}}{{
OUTPUT: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

Additional rules

  • You cannot use any built-ins that compute the BRLE or its inverse of a string.

  • You can use built-ins that:

    • Compute the RLE or RLD of a string, as long as the number of repetitions is not stored in bijective base-2.

    • Perform base conversion of any kind.

  • Your code may print a trailing newline if you choose STDOUT for output.

  • Your code has to work for any input of 1000 or less ASCII characters in the range from 0x20 to 0x7A, plus curly brackets (0x7B and 0x7D).

  • If the input contains curly brackets, you may assume that it is a valid result of applying the BRLE to a string.

  • Standard code golf rules apply. The shortest submission in bytes wins.

Dennis

Posted 2015-07-13T19:57:56.783

Reputation: 196 637

Why are builtins not allowed? – MilkyWay90 – 2019-06-02T17:25:10.633

Answers

4

CJam, 50 48 bytes

l_{}`:T&1${_T#)_(@a?)+}%{(\2b)*}%@e`{(2b1>Tf=}%?

Thanks for Dennis for saving 2 bytes.

Try it online.

Explanation

l_              e# Read and duplicate input.
{}`:T           e# T = "{}"
&               e# If the input has '{ or '}:
    1$          e# Input.
    {           e# For each character:
        _T#)    e# If it is '{ or '}:
            _(  e# Return 0 for '{ or 1 for '}.
            @a  e# Otherwise, convert the character itself to an array (string).
        ?
        )+      e# If it is a number, increment and append to the previous array.
                e# If it is a string with at least 1 character, do nothing.
    }%
    {(\         e# For each character and bijective base 2 number:
        2b)*    e# Repeat the character 1 + that many times.
    }%
                e# Else:
    @           e# Input.
    e`          e# Run-length encoding.
    {(          e# For each character and length:
        2b1>    e# Convert the length to base 2 and remove the first bit.
        Tf=     e# Map 0 to '{ and 1 to '}.
    }%
?               e# End if.

jimmy23013

Posted 2015-07-13T19:57:56.783

Reputation: 34 042

3

Pyth, 48 50 bytes

J`Hs?m*hdi+1xLJtd2tczf-@zTJUz@Jzsm+ed@LJtjhd2rz8

2 bytes thanks to @Maltysen.

Demonstration. Test harness.

Explanation:

J`Hs?m*hdi+1xLJtd2tczf-@zTJUz@Jzsm+ed@LJtjhd2rz8
                                                    Implicit: z = input()
                                                    H is empty dict.
J`H                                                 J = repr(H) = "{}"
   s                                                Print the concatenation of
    ?                        @Jz                    If z and J share any chars:
                     f     Uz                       Filter range(len(z))
                      -@zTJ                         On the absence of z[T] in J.
                   cz                               Chop z at these indices.
                                                    just before each non '{}'.
                  t                                 Remove empty initial piece.
     m*hd                                           Map to d[0] *
         i       2                                  the base 2 number                             
            xLJtd                                   index in J mapped over d[:-1]
          +1                                        with a 1 prepended.
                                             rz8    Otherwise, run len. encode z
                                 m                  map over (len, char)
                                         jhd2       Convert len to binary.
                                        t           Remove leading 1  
                                     @LJ            Map to element of J.
                                  +ed               Prepend char.
                                s                   Concatenate. 

isaacg

Posted 2015-07-13T19:57:56.783

Reputation: 39 268

instead of "{}" you can use \H`, tied with CJam :) – Maltysen – 2015-07-15T09:50:23.213

@Jakube Sorry for the mixup. – isaacg – 2015-07-15T21:23:27.203

2

OCaml, 252

t is the function that does the transformation.

#load"str.cma"open Str
let rec(!)=group_beginning and
g=function|1->""|x->(g(x/2)^[|"{";"}"|].(x mod 2))and($)i s=if g i=s then i else(i+1)$s and
t s=global_substitute(regexp"\(.\)\1*\([{}]*\)")(fun s->String.make(1$matched_group 2 s)s.[!1]^g(!2- !1))s

At first I was thinking I had to check for the presence of curly braces, but it turned out to be unnecessary. The decoding clearly has no effect on strings that are already decoded, and the encoding part proved equally harmless to ones that were already encoded.

feersum

Posted 2015-07-13T19:57:56.783

Reputation: 29 566

the encoding part proved equally harmless does it? Encoding 4{{8{{{G{{{{W{{{{{ don't you get 4{{8{}G{{{W{{}? – edc65 – 2015-07-15T09:04:03.293

@edc65 No, I get the answer specified in the examples. How are you testing it? – feersum – 2015-07-15T09:13:35.183

"4{{8{{{G{{{{W{{{{{" as input is not one of the examples. Did you tried it? – edc65 – 2015-07-15T09:41:08.533

@edc65 It is the inverse of one of the examples and I tested them both ways. Yes, I tried it, both before posting and after your comment. – feersum – 2015-07-15T09:43:19.613

Ok, good. I pointed out the quoted sentence, as a "straightforward` encoding (as mine) would not be harmless at all with the given test case. Clearly your encoding part is more clever. – edc65 – 2015-07-15T09:56:49.767

1

JavaScript (ES6), 162

f=s=>
(t=s[R='replace'](/[^{}][{}]+/g,n=>n[0].repeat('0b'+n[R](/./g,c=>c!='{'|0))))==s
?s[R](/(.)\1*/g,(r,c)=>c+r.length.toString(2).slice(1)[R](/./g,c=>'{}'[c])):t

// TEST
out=x=>O.innerHTML += x + '\n';

test=s=>O.innerHTML = s+' -> '+f(s) +'\n\n' + O.innerHTML;

[['CODEGOLF','CODEGOLF']
,['PROGRAMMING','PROGRAM{ING']
,['PUZ{LES','PUZZLES']
,['444488888888GGGGGGGGGGGGGGGGWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW','4{{8{{{G{{{{W{{{{{']
,['y}}}{{','yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy']]
.forEach(v=>{
  w=f(v[0])  
  out('Test ' + (w==v[1]?'OK':'Fail')+'\nInput:    '+v[0]+'\nExpected: '+v[1]+'\nResult:   '+w)
})  
Your test: <input id=I style='width:300px'><button onclick='test(I.value)'>-></button>
<pre id=O></pre>

Some explanation

Number n to BB2 using 0 and 1: (n+1).toString(2).slice(1)

String in BB2 to number: to_number ("0b1"+string) - that is, add a leftmost 1 binary digit and convert from binary (and decrease by 1, not needed in this specific instance).

Regular expression to find any character followed by { or }: /[^{}][{}]+/g

Regular expression to find repeated characters: /(.)\1*/g

Using that regexp in replace, first param is the "repeated" char (eventually repeat just 1 time), second param is the total repeated string, whose length is the number that I need to encode in BB2 already incremented by 1

... then put all together ...

edc65

Posted 2015-07-13T19:57:56.783

Reputation: 31 086