Compressing a DNA-like string

1

1

Challenge:

Write a a program to take an array of 24 4-letter strings, and output a compressed (lossless) representation of the array. You should also provide a program to decompress your representations.

Input:

The input will be comprised of 4 letter words using each of [G,T,A,C] exactly once (e.g. GTAC, ATGC, CATG, etc). There are always exactly 24 words, so the full input is something like:

GTAC CATG TACG GACT GTAC CATG TACG GACT GTAC CATG TACG GACT GTAC CATG TACG GACT GTAC CATG TACG GACT GTAC CATG TACG GACT

So there are \$4!=24\$ possibilities for each word, and 24 words in total.

Output:

The output should be a printable ASCII string that can be decompressed back into the input string. Since this is lossless compression, each output should match to exactly one input.

Example:

>>> compress(["GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT"])
"3Ax/8TC+drkuB"

Winning Criterion

Your code should minimize the output length. The program with the shortest output wins. If it's not consistent, the longest possible output will be the score.

Edit: Based on helpful feedback comments, I would like to amend the criterion to add:

If two entries have the same results, the shortest code measured in characters wins. If an entry is length dependent, a length of 1 million words is the standard.

However, I fear the community angrily telling me I didn't say it right and downvoting me to the abyss, so I leave to community editors to see if it makes sense with code golf traditions

AwokeKnowing

Posted 2013-12-11T18:53:00.437

Reputation: 201

1I tried to reword this and clean it up a bit. I did change one thing: now, the output is scored by the maximum length instead of the average, for ease of scoring. I hope this is okay with you. If you want to do average, you should pick a specific set of inputs to test it on. – Rɪᴋᴇʀ – 2019-05-17T02:19:55.337

You also changed the output from alphanumeric to the entirety of printable ASCII, but the one existing answer didn't comply with that to begin with. – Unrelated String – 2019-05-17T02:30:04.190

4Despite the edits for clarity, this remains a challenge that seems pretty boring. An optimal solution is to map the n'th possible input onto the n'th ASCII string sorted lexicographically, so anyone can win by coding that or anything achieving the same maximum length. In fact, the existing solution does exactly this, using alphanumeric strings as the old rule required which could be easily adjusted to all printable ASCII. I think a better direction would be to make the challenge code golf and require the optimal output length. – xnor – 2019-05-17T02:31:39.790

@xnor a good comment. If optimal solutions are known, then code golf should be the standard. – AwokeKnowing – 2019-05-17T14:21:55.900

Answers

6

Python (output-length: 18 or 19, mean 18.80)

#list of the 24 possible words 
Lbase = ['GTAC','GTCA','GATC','GACT','GCTA','GCAT','TGAC',\
'TGCA','TAGC','TACG','TCGA','TCAG','AGTC','AGCT','ATGC',\
'ATCG','ACGT','ACTG','CGTA','CGAT','CTGA','CTAG','CAGT','CATG']

#base-64 alphabet (used only in the output string-representation) 
alph = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/'

#change a bijective list-representation base from b to newb 
def chg_bij_base(L, b, newb):
    n = sum([L[i]*b**i for i in range(len(L))])
    L = []
    while n > 0:
        L.append(1+(n-1)%newb)
        n = (n-1)//newb
    return L

#convert a bijective list-repr. to base-64 string-repr.
def list_to_str(Lin):
    L24 = [Lbase.index(w)+1 for w in Lin]
    L64 = chg_bij_base(L24,24,64) 
    return ''.join([alph[i-1] for i in L64])

#convert a base-64 string to the corresponding base-64 list-repr.
def str_to_list(s):
    L64 = [alph.index(c)+1 for c in s]
    L24 = chg_bij_base(L64,64,24)
    return [Lbase[i-1] for i in L24]

Test run:

Lin = ["GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT", "GTAC", "CATG", "TACG", "GACT"]

#convert input list to a base-64 string  
lts = list_to_str(Lin) 

#convert the base-64 str back to the list   
stl = str_to_list(lts)

print(len(lts))
print(lts)
print(stl == Lin)

Test output:

18
acFZIBuiiZJ+dwUn2V
True

This uses bijective numeration (which has no 0 digit) to get invertibility.

NB: When using base-64 with this method, all output strings for the above problem are either 18 or 19 characters. (A million random inputs gave an average of 18.80.) The same method works with other bases:

base         output-length
---------    -------------
56 - 58      19
59 - 69      18 or 19
70 - 73      18
74 - 88      17 or 18
89 - 96      17
...          ...
233 - 256    14

r.e.s.

Posted 2013-12-11T18:53:00.437

Reputation: 2 872

Well done. That's exactly the kind of thing I expected. What if I added more characters to the list, like !.<>[]{} , then change the references from 64 to 72? will the algorithm work the same way? (I just wanted to know that before I port it to php and js). – AwokeKnowing – 2013-12-12T19:21:56.293

There's a guy over on SO who says it's impossible to code 24^24 inputs into 18 characters using a 64-character set. Does this algorithm give considerably longer strings when the inputs are more random? The SO question is here: http://stackoverflow.com/questions/20527243/simple-compact-code-for-compressing-dna-like-strings/20534427

– AwokeKnowing – 2013-12-12T20:54:25.153

@AwokeKnowing - When using a 64-character alphabet with this method, all output strings for the above problem are either 18 or 19 characters. A million random inputs gave an average of 18.80. The same method works with larger alphabets; e.g., with size 90, a million random inputs all had output strings of length 17. I don't know how well the algorithm might port to JS or PHP. – r.e.s. – 2013-12-13T04:41:39.597

@res Well thanks. looks like people who didn't have an answer decided to blame the question :) oh well. I've accepted your answer, great job. – AwokeKnowing – 2013-12-13T16:37:28.210