DNA Encode a String

7

2

Challenge

You must write an encoder (and a separate decoder) which takes a string as input and outputs the string encoded in the style of a strand of DNA.

DNA

DNA is made up of four types of nucleotide:

  • Adenine (A)
  • Thymine (T)
  • Cytosine (C)
  • Guanine (G)

Adenine and thymine pair up together to make AT or TA. Similarly, cytosine and guanine pair up together to make CG or GC.

Let's call these pairs units. This means that your encoded string is only allowed to contain the four units AT, TA, CG and GC.

This means that:

ATTAGACG

Would be invalid because it contains the unit GA, which is impossible.

Similarly,

TAGCGCCGATA

Is invalid because the final A does not have a thymine to pair up with.

Example

You must encode the following example text and use its encoded length as part of your score:

I have a friend who's an artist and has sometimes taken a view which I don't agree with very well. He'll hold up a flower and say "look how beautiful it is," and I'll agree. Then he says "I as an artist can see how beautiful this is but you as a scientist take this all apart and it becomes a dull thing," and I think that he's kind of nutty. First of all, the beauty that he sees is available to other people and to me too, I believe. Although I may not be quite as refined aesthetically as he is ... I can appreciate the beauty of a flower. At the same time, I see much more about the flower than he sees. I could imagine the cells in there, the complicated actions inside, which also have a beauty. I mean it's not just beauty at this dimension, at one centimeter; there's also beauty at smaller dimensions, the inner structure, also the processes. The fact that the colors in the flower evolved in order to attract insects to pollinate it is interesting; it means that insects can see the color. It adds a question: does this aesthetic sense also exist in the lower forms? Why is it aesthetic? All kinds of interesting questions which the science knowledge only adds to the excitement, the mystery and the awe of a flower. It only adds. I don't understand how it subtracts.

Bounty

If you answer this challenge in the language DNA#, I will offer a 100 to 200 rep bounty.

Rules

Your encoder and decoder programs must work for any string supplied. Therefore, your encoder must produce a genuine encoding and your decoder should not simply hardcode the example text.

You must support all printable ASCII characters:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

If you can support more characters, that is acceptable.

Scoring

Your score is how long in characters (not units) the encoded version of the above text plus the length in bytes of your decoder and your encoder.

Winning

The program with the lowest score wins.

Beta Decay

Posted 2016-08-30T08:41:30.863

Reputation: 21 478

Why is the encoder not included in the byte-count? – Leaky Nun – 2016-08-30T08:47:01.370

@LeakyNun I don't know, I guess I didn't think about both of them. It's been included now – Beta Decay – 2016-08-30T08:50:53.473

1Any valid program in Brainfuck is an equivalent valid program in DNA#. – Leaky Nun – 2016-08-30T08:53:31.550

@LeakyNun I know but I still want to see it in DNA# – Beta Decay – 2016-08-30T08:54:12.693

Are only alpha numeric, dot, space coma and quote valid or any printable ascii character allowed in the input? – Sefa – 2016-08-30T09:00:08.373

If you're including the encoder in the score then you should remove the tag [tag:kolmogorov-complexity] – Peter Taylor – 2016-08-30T09:44:58.990

@PeterTaylor Oh, okay – Beta Decay – 2016-08-30T09:45:26.707

What's the maximum input length? – Mast – 2016-08-30T12:11:23.603

@Mast There isn't a maximum As long as your program will handle – Beta Decay – 2016-08-30T12:41:11.093

any source for DNA# interpreter? the linked one on esolang seems to be broken for me

– Aaron – 2016-08-30T20:03:56.387

@Aaron Not that I know of, but I'll have a look around – Beta Decay – 2016-08-30T21:07:12.960

@βετѧΛєҫαγ: This challenge gives me an idea for another DNA-related one (namely, restriction mapping)! Unmodified, it would probably be too complicated for a classical code golf task, however. – Tim Čas – 2016-08-30T23:45:01.763

May the encoder and the decoder have some code in common? Or are they supposed to be separate standalone modules? – Arnauld – 2016-08-31T09:05:54.700

@Arnauld They have to be completely standalone – Beta Decay – 2016-08-31T09:07:04.017

I'm working on a simple python interpreter for DNA#, although I've decided to make one small alteration to the language.. where's the best place to post the script? – Aaron – 2016-09-01T19:34:59.390

@Aaron Anywhere really. Github, Pastebin... – Beta Decay – 2016-09-01T19:36:50.263

Answers

6

Pyth, 5152 + 21 + 23 = 5196 bytes

Encoder

sm@c4s_B"ATCG"djC.ZQ4

Try it here

Decoder

.ZCimxc4s_B"ATCG"dcQ2 4

Try it here (Input too long to link, put output of first in quotes in the input box)

Compresses the input and converts it from base 256 into an integer. Then converts into a sequence of base 4 numbers and chooses from ['AT', 'CG', 'GC', 'TA']

Blue

Posted 2016-08-30T08:41:30.863

Reputation: 26 661

@βετѧΛєҫαγ I'm not sure why I didn't originally – Blue – 2016-08-30T08:59:56.187

@LuisMendo added some form of explanation – Blue – 2016-08-30T11:52:35.837

Thanks. The part "Compresses the input" was what I was missing – Luis Mendo – 2016-08-30T11:57:19.453

Without compressing the score would be about twice as big – Blue – 2016-08-30T11:58:26.087

5

DNA# : 10216 + 852 + 384 = 11452 bytes

This is not exactly tested, as I haven't gotten a hold of a working interpreter... There are probably mistakes

I use a simple brute force approach so far where each character is read in, then broken up into 2 bit blocks with AT = 00; TA = 01; CG = 10; GC = 11. The sample text is 1277 ASCII characters, and it takes 4 codes (2 bits) to encode each 8 bit char: 1277 * 4 * 2 = 10216 bytes

Encoder

852 bytes - line form

GCGCGCTAATATATTAATTAATTAATTAATTAATTAATTAATTATACGCGCGATATTAATATGCATGCCGCGGCTACGATATGCCGCGATATATTAATTAATTAATTAATTAATTAATTAATTATACGCGCGATTAATATATTAATTAATTAATTAATTAATTAATTAATTAATTATACGCGCGATTAATTAATTAATGCATGCGCTAATATATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATATATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATGCATGCATCGGCCGGCTAATATATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATATATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATGCATGCATCGGCCGGCTAATATATTAATTAATTAATTAATATATCGATCGATCGATCGATGCATGCATCGGCCGATATGCATATATGCATATGCATGCATGCATCGGCTAATTAATATTAATATGCATGCCGCGCGATATGCCGCGTACGATGCCGCGATGCATGCTATAATATATATCGCGATATATATGCTAATCGGCCGATTAATTAATTAATTAATGCCGATATATCGCGATATTAATATGCATGCCGCGATATGCTAATCGGCCGATATGCTAATCGGCCGGCCGATGCATGCGCCGATGCATGCGCGCGCCG

224 bytes - symbol form:

,[>++++++++*=X>:=<<X[/=<X>++++++++*=X+>+++++++++*=X+++<<[>+++++++++++++++++++>-------------------<<-][>----------------->+++++++++++++++++<<-][>++++>----<<-]>.>.<<<-[+>:=<<X/=<X*=<X<<-=>>X>>[-]++++</=>X>:=<<X>[-]>[-]]<<]<<,]

and with variable names and some messy comments...

DNA# dna encoder

a = input              //get first char
[                      //for each char
    b = 64             //divisor to get first 2 bits
    c = a              //initialize as a
    [                  //while c != 0
        c /= b
        d = 65         //first print char
        e = 84         //second print char
        c              //select c
        [              //if c > 0 (3,2,1)
            d + 19
            e - 19
            c - 1
        ]
        [              //if c > 1 (3,2)
            d - 17
            e + 17
            c - 1
        ]
        [              //if c > 2 (3)
            d + 4
            e - 4
            c - 1
        ]              //c is now 0 (hopefully)
        print d
        print e
        b - 1
        [              //if b is not 1 (last 2 bits)
            b + 1      //reset b
            c = a      //trash top 2 bits (modulus algorithm)
            c /= b
            c *= b
            a -= c
            c = 0      //reset c
            c + 4
            b /= c     //b >> 2
            c = a      //copy a again for next 2 bits
            d = 0      //get a 0 value so we don't loop
        ]
        c              //select c (0 if done with char: exit loop and get next char)
    ]
    a = input          //get next char
]

Decoder

384 bytes - line form

GCGCGCGCGCTAATATATATATTAATTAATTAATTAGCTAATATATTAATTAATTAATTAATTAATTAATTAATTAATTATACGCGCGATTAATTAATTAATGCATGCATGCTATAATATATATATATCGCGGCTAATATATTAATGCATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAATTAGCCGGCTAATATATTAATGCATCGATCGATCGATCGATCGATCGGCCGGCTAATATATTAATGCATTAATTAATTAATTAGCCGATTAATTAATTAATTAATATTACGATGCCGCGATGCGCGCGCGCATATATATATCGGCCGATGCGCATATGCGCCG

99 bytes - symbol form

,,[>>++++[>+++++++++*=X+++<<<-=>>>X[>+<+++++++++++++++++++][>+<------][>+<++++]++++>*=<X<,,>>-]<.<]

var names...

DNA# dna decoder


,
,                  /* read second char into a*/
[                  /* for each set of 4 pairs */
    c = 4          /* loop iterator */
    [
        a - 84     /* if pair was AT (65,84): don't increment b */
        [
            b + 1
            a + 19 /* if pair was TA (84,65): increment b once */
        ]
        [
            b + 1
            a - 6  /* if pair was CG (67,71): increment b twice */
        ]
        [
            b + 1
            a + 4  /* if pair was GC (71,67): increment b three times */
        ]
        a = 4
        b *= a     /* shift b up 2 bits */
        a,,        /* get next char (skip one) */
        c - 1      /* decrement loop iterator: exit loop on 0 */
    ]
    b
    print          /* output b in ASCII */
    a              /* select a (next char) */
]                  /* end if next char is 0 (null terminated string) */

If anyone finds a mistake or is able to run the web interpreter feel free to tell me how wrong I am :P

Aaron

Posted 2016-08-30T08:41:30.863

Reputation: 1 213

1I'll start work on making an interpreter tomorrow – Beta Decay – 2016-08-31T18:31:20.903

I'm going to give you the bounty, but could you do this in line form instead of symbol form? I feel it'd be more in the spirit of the challenge :) – Beta Decay – 2016-08-31T19:36:32.783

@βετѧΛєҫαγ haha yah, sure.. I used symbol form to develop as it was more readable (if you could call it that) and somewhat brainfu*k compatible. – Aaron – 2016-09-01T17:08:43.457

3

Python 2, 10216 5152 + 72 98 + 122 110 139 = 10410 10398 5389

Encoder:

import zlib
lambda s:''.join('ACGTTGCA'[ord(c)>>b&3::4]for c in zlib.compress(s)for b in(0,2,4,6))

Decoder:

import zlib
lambda s:zlib.decompress(''.join(chr(int(''.join([str('ACGT'.index(s[i+j]))for j in(6,4,2,0)]),4))for i in xrange(0,len(s),8)))

Neil

Posted 2016-08-30T08:41:30.863

Reputation: 95 035

Wow, that encoder and that decoder is so short – Beta Decay – 2016-08-30T19:05:24.280

@βετѧΛєҫαγ Even shorter once I'd fixed the byte count and golfed some more bytes off! – Neil – 2016-08-30T20:15:58.713

@βετѧΛєҫαγ ...although I've since figured out compression, as evidenced by getting the same score for my DNA as everyone else. – Neil – 2016-08-30T20:31:05.977

3

JavaScript (ES6), 10216 8520 + 85 129 + 97 147 = 10398 8796

Encoder:

s=>s.replace(/..?.?/g,b=>"9876543210".replace(/./g,b=>'ACGT'[b=r>>b*2&3]+'TGCA'[b],r=0,[...b].map(c=>r=r*96+127-c.charCodeAt())))

Decoder:

s=>s.replace(/../g,c=>'ACGT'.search(c[0])).replace(/.{10}/g,b=>String.fromCharCode(...[9216,96,1].map(n=>parseInt(b,4)/n%96^127).filter(n=>n^127)))

Works by packing up to three printable ASCII characters into 20 nucleotides, thus saving 16.6% on average. The characters are mapped from 95 to 1 to shorten the decoder, as it can then use ^127 to floor the division and restore the character code. Excess nucleotides get decoded into 127 which needs to be skipped. The encoder could be 1 byte shorter and the decoder 7 bytes shorter in Firefox 30-57 (note that these are incompatible with the above):

s=>s.replace(/..?.?/g,b=>"9876543210".replace(/./g,b=>'ACGT'[b=r>>b*2&3]+'TGCA'[b],r=0,[...b].map(c=>r=r*96+c.charCodeAt()-31)))

s=>s.replace(/../g,c=>'ACGT'.search(c[0])).replace(/.{10}/g,b=>String.fromCharCode(...(for(n of[9216,96,1])if(n=parseInt(b,4)/n%96|0)n+31)))

Neil

Posted 2016-08-30T08:41:30.863

Reputation: 95 035

3

Javascript (ES6), 6138 6038 bytes

These functions will encode and decode any printable ASCII characters as required by the challenge rules. They are however using a static Huffman code which is optimized for the example text. They should work pretty well on any other English text, as long as there aren't too many capital letters, digits or miscellaneous symbols. But they will perform poorly on a random input.

Edit: saved 94 bytes by rethinking the decoder logic and including all hints provided by Neil

Encoder (305 281 bytes)

s=>(n=0,C={},` e|ahinost|cdlr|.Ibfmuwy|',gpv|"k|;?ATqx|FHW|j|:`.split`|`.map((g,i)=>([...g].map(c=>C[c]=('00'+(n++).toString(2)).slice(-i-3)),n*=2)),s=s.replace(/./g,c=>C[c]||(524160|c.charCodeAt()).toString(2)),s.length&1&&(s+=0),s.replace(/../g,c=>'ATCG'[c=+`0b${c}`]+'TAGC'[c]))

Decoder (391 315 bytes)

s=>(n=N=0,C={},` e|ahinost|cdlr|.Ibfmuwy|',gpv|"k|;?ATqx|FHW|j|:`.split`|`.map((g,i)=>([...g].map(c=>(C[j=i+3]=(C[j]||{}))[n++]=c),n*=2)),s=s.replace(/../g,c=>''+((x='ATCG'.search(c[0]))>>1)+(x&1)),eval(`for(r='';N<s.length-1;){for(i=j=0;!(f=C[j]&&C[j][i])&&j++<19;i+=+s[N++]+i);r+=f||String.fromCharCode(i&127)}`))

DNA sequence (5442 bytes)

Below is the DNA sequence which is generated for the example text.

GCTATAATATCGCGCGTAGCGCATATCGATTAATATTACGGCGCCGTATACGATCGGCGCTAGCATTAGCTAATCGGCATTAGCCGATCGTAATATCGATGCCGATTAATGCATGCTAATGCTAATGCTAATATTAATTAGCCGGCCGATTATATAATCGTAATTAATGCATTAGCATATTACGCGTACGGCCGATATGCATCGATCGCGTAATGCGCCGGCATCGGCCGATTAATATTAGCGCATTACGATGCGCTAATATGCCGCGTATATACGCGGCATCGCGATGCTATAATTATAGCCGATTAGCGCGCATTATAATATTAATGCGCTATACGTAATCGTAATTAGCTAATGCTATAATCGCGATGCGCCGATTAGCATGCGCTACGATGCCGCGATGCCGATGCATTACGCGATATGCGCGCGCTAATGCGCCGATGCATTACGATATATCGGCATTACGATCGGCCGATGCCGTAGCGCTACGATTAATATTACGGCGCCGATCGATGCCGCGATGCCGTAATATCGATGCGCTAGCATTAATCGCGTAGCTACGATGCGCCGCGGCATTAATTAATTAGCGCTACGATTATACGATGCCGCGATTACGGCATTATAATGCCGTACGCGTACGGCTAGCGCCGTAGCATATATTACGCGCGATATGCTAATGCGCCGTAGCGCCGCGATATCGATGCGCTAGCATTACGCGGCGCCGATGCATTACGATATATCGTAGCCGCGGCATCGTAATGCCGCGATATGCGCGCTACGCGCGTATAGCATATCGCGTAATTAATCGCGTAGCTAGCATCGATGCGCCGCGGCTATAATATCGTAATCGATTAATTAGCATATCGTACGTACGCGTACGCGTACGCGATTATACGTAATTAGCATTAATCGTAATCGATTATACGATGCCGCGATTACGGCATTATAATGCCGTACGCGTACGGCTAGCGCCGTAGCATATATCGCGTATATACGCGTAATATGCTAATCGATGCTACGGCCGTACGCGATTAGCTAGCATTAGCATCGATTAATCGTAATATCGATATCGTACGGCATGCATTATAGCCGCGTACGCGTACGCGATTATAATCGTAGCGCTACGTAATTATAATCGCGGCTAATCGATTAATGCATTACGATATATCGTAGCCGGCTAATGCATGCTAATATTAATTAGCCGGCCGATTACGCGCGATTACGGCATTACGGCTAATTAGCATATTACGTAATATCGATATCGGCGCGCATGCCGATGCATATATCGCGTATATACGTAGCGCGCTATAGCCGTAGCGCCGCGATATCGATGCGCTAGCATTACGCGCGATCGCGTATATACGTAGCGCGCCGGCATTATAATCGCGCGTATAATATTATAATGCGCCGATCGTAATTAGCGCTACGGCATGCGCTAGCATTAATTACGGCCGATTAGCGCCGTACGCGCGCGGCCGGCGCTAATATTAGCGCGCCGATGCTACGTACGTACGCGATTAATTACGGCCGATTAATGCATTACGATGCGCATCGATCGCGTATAATCGATGCTACGATCGCGTAGCATGCTATAGCTACGATCGCGTATATAATCGCGATATCGCGTAATTAATCGTAATGCATCGATTACGCGTAATATCGTAGCGCATTAATTACGGCATATCGTACGGCTACGATATCGATCGCGCGATATTAATTATAATCGCGTAGCATCGATGCGCTACGTACGATGCGCTAGCCGATATCGATTAATTAGCCGGCCGATCGCGCGATATTAGCATATTAATTATATAATTAATTAGCCGTAATTACGCGCGATGCTACGATGCCGATTACGATGCGCGCATATGCCGCGATATGCGCGCTATACGATCGCGTATACGATGCCGTAGCGCTAATCGCGATGCTATAATTAGCATATCGTAGCTACGATTAGCCGATCGCGATTACGGCATTAATTAGCGCGCATGCCGTATACGCGCGATCGATTAATCGTAATTACGTAATGCCGGCCGGCATGCCGTACGGCCGATTAATATGCATGCTAATCGCGTACGCGTACGCGGCATCGTACGATGCATTAGCTACGATTAATCGTAATATCGCGTAATATGCTAATCGATGCTAATGCTAATGCTAATATTACGCGCGATCGGCATCGATGCCGATTAATGCGCTAGCGCCGGCGCATCGTACGGCATGCATCGTATAATTAATTATAATCGCGTAATTACGGCATTATAATGCCGTACGCGGCCGGCATTAATTACGGCCGATTAATATTACGGCGCCGATCGATGCCGCGATGCCGTAGCTAATATTAGCGCCGCGCGCGATTATAATCGCGTAATTAATCGCGTAGCATATTAATTATAATGCTAGCATATTAGCGCATCGATGCTATAATTAATCGTAATCGATGCCGATGCCGTACGGCATCGCGATGCCGATCGATGCATCGTAATATCGTACGGCTAATTAGCATGCTAATATCGCGTATAATCGATGCTAGCGCATTAATTAGCTAATTAGCATCGATCGCGTATATAATTAGCATATCGCGTAATTAATCGTAATGCATGCCGCGATATGCTATAATTATACGCGATGCCGTAGCATTATAGCATATGCTAGCATATCGTAGCCGCGTACGTAGCATCGATCGCGTATAATCGATCGGCATTAGCATTACGATCGTAATATGCATGCCGATCGCGTATAATGCCGTAATGCGCCGTAATTATAATCGCGTAATTATACGCGATGCCGATGCGCTAGCCGATTACGCGGCATCGTATAATTACGGCCGATTAATCGGCTATAATGCTAATATGCGCATCGATTACGTAGCCGTATACGCGGCCGTAGCGCATCGATGCCGCGTATATACGCGGCATCGCGATTAATGCATTAATGCATATATTATATAATGCGCCGATTAATATCGATATGCTACGATCGCGTAGCATGCTATAGCTAGCCGCGATATGCTATAATTAGCATATTATAATTAGCATATGCTATATAGCCGATCGTAATATGCGCATTATAATATGCGCGCGCGCTAGCATGCATGCTAATATGCTACGATCGCGTAGCATGCTATAGCTACGATTAATCGCGATTATAATCGCGGCTAATCGATCGGCCGGCTAGCATATTATAGCCGTATACGCGATTAGCGCGCATCGATTAATCGCGATTAATATGCCGTAATTATACGATCGGCGCTAATGCTAGCATATTACGCGATGCCGTAGCGCGCATATATCGCGTATAATGCCGTAATGCGCCGATCGTAATATCGTACGATCGTACGATATTACGGCATTATAATGCCGTACGCGGCCGGCATATCGTATAATATCGTAGCCGATTAATGCATTACGATATGCCGTAATTATAGCTACGGCCGATATCGGCGCATCGGCTAATATGCGCATGCGCCGTAATTATAATCGCGTAATATGCATGCCGGCCGTAGCATCGATCGTACGCGGCATGCGCATGCTACGCGCGGCCGTAGCATCGTAGCGCATCGATTAATGCATTAATGCATATATCGCGTATAATCGATGCGCTAGCCGTACGATCGGCATTACGTACGTAATGCATGCCGCGATATGCGCGCTACGCGCGTAATTACGGCCGCGTATACGCGCGATTATAATCGCGCGTATAATATCGCGTATAATCGATCGGCTAATTACGATCGATGCATGCATCGATTACGTAGCATTATAATCGCGTAATTACGGCGCCGATCGATGCCGCGATGCCGTAATATTAGCGCCGTAATTACGATGCGCCGATTACGGCCGATTACGTAGCATTAATTACGTACGGCCGTAGCATCGATCGCGCGATATATCGTATATATATACGTATAATCGGCTATAATATTACGTAGCCGTAATGCTACGCGCGCGTAATTATATAATATATGCGCTAGCATTACGATGCATATGCATGCCGCGTATAATTAATATGCTATAATATTACGCGTAATATGCATGCGCTAATTAGCATCGTACGTACGCGTACGTAGCGCGCTATAGCGCCGATATATGCTATAATATGCCGATATCGCGATGCGCATCGATCGCGTATATAATCGCGATATGCATGCGCATCGTACGGCTATATAATCGATCGGCATCGATGCCGATCGTAATCGTAATTATAATCGCGTAATTATACGCGATGCATTAATTACGTAGCTAATATTACGCGGCTAATATTAATCGGCGCTAGCCGTAATATCGATATGCGCGCCGTAGCATCGTACGTACGCGTACGCGATTAGCGCGCGCGCGCCGATTATAGCCGATATGCATCGATCGCGTATATACGCGTAATATCGATTACGTACGCGTATAATGCTAATGCTATACGATTAATCGTATAGCCGTAATCGATTAATGCATTAATGCATATATATGCGCGCGCTATACGCGTACGCGATATGCATGCCGATCGCGTATAATCGATGCATTAATTAGCTAATTAGCATCGATGCTAGCCGATGCATGCGCATTAATGCGCGCCGTAATTAGCGCGCGCATCGGCGCTACGATTACGCGTAATATGCTATAATATTAATATGCATGCTAATCGCGTACGCGTACGCGGCTAGCGCCGTAATTAGCGCCGCGGCATTACGATATTAGCGCTACGGCATGCGCTAGCCGTAATTAATTACGGCCGATTACGTAGCCGCGATGCCGTAATGCATGCTAATGCATGCGCGCCGCGATTAGCGCGCATGCCGTAATGCATGCTAATGCTAATATGCGCATCGATGCCGCGTATATACGCGGCATCGCGATCGCGTATAATCGATCGTACGGCATGCATTATAGCCGGCATTAATTAGCGCTACGGCGCATTAGCTATACGATATGCTAGCGCGCTAATTAATTAATATGCGCCGATGCCGGCATATCGTATAGCCGGCGCATCGATCGCGCGATATTATAATCGCGTAATATTAGCGCGCCGGCTACGTACGCGCGATGCGCATATTATAGCCGCGGCGCATCGATCGCGTATAATCGATGCCGATGCCGGCCGTACGCGATGCCGTAGCCGGCATATCGATGCGCTAGCATTATAATCGCGTAATATCGTAGCTAATTAATTAATTACGGCCGATTAATATTACGGCGCCGATCGATGCCGCGATGCCGTAGCTAATATTACGCGGCTAATATCGATTAGCGCATTAGCTACGATTAATCGGCGCTAGCCGTAGCTAATATTACGCGCGATCGGCGCATATGCGCGCCGATCGCGATTAGCATCGGCGCTAGCATGCCGTACGTACGCGTAATTAGCCGGCCGATTATACGATGCCGCGATATGCTATAATATCGTAGCCGTAGCTACGCGCGGCATCGCGTATACGCGCGCGTAGCTAAT

Demo

The snippet below includes some demonstration code.

let e =
s=>(n=0,C={},` e|ahinost|cdlr|.Ibfmuwy|',gpv|"k|;?ATqx|FHW|j|:`.split`|`.map((g,i)=>([...g].map(c=>C[c]=('00'+(n++).toString(2)).slice(-i-3)),n*=2)),s=s.replace(/./g,c=>C[c]||(524160|c.charCodeAt()).toString(2)),s.length&1&&(s+=0),s.replace(/../g,c=>'ATCG'[c=+`0b${c}`]+'TAGC'[c]))

let d =
s=>(n=N=0,C={},` e|ahinost|cdlr|.Ibfmuwy|',gpv|"k|;?ATqx|FHW|j|:`.split`|`.map((g,i)=>([...g].map(c=>(C[j=i+3]=(C[j]||{}))[n++]=c),n*=2)),s=s.replace(/../g,c=>''+((x='ATCG'.search(c[0]))>>1)+(x&1)),eval(`for(r='';N<s.length-1;){for(i=j=0;!(f=C[j]&&C[j][i])&&j++<19;i+=+s[N++]+i);r+=f||String.fromCharCode(i&127)}`))

function encode() {
  var txt, dna;

  if(txt = document.getElementsByTagName('textarea')[0].value) {
    dna = e(txt);
    document.getElementsByTagName('textarea')[0].value = '';
    document.getElementsByTagName('textarea')[1].value = dna;
    document.getElementsByTagName('div')[0].innerHTML = 'DNA length: ' + dna.length + ' (' + (dna.length / txt.length).toFixed(2) + ' nucleotides per character)';
  }
}

function decode() {
  var txt, dna;

  if(dna = document.getElementsByTagName('textarea')[1].value) {
    txt = d(dna);
    document.getElementsByTagName('textarea')[0].value = txt;
    document.getElementsByTagName('textarea')[1].value = '';
    document.getElementsByTagName('div')[0].innerHTML = 'Text length: ' + txt.length;
  }
}
textarea {font-size:10px;font-family:Arial;width:400px;height:70px}
<textarea>I have a friend who's an artist and has sometimes taken a view which I don't agree with very well. He'll hold up a flower and say "look how beautiful it is," and I'll agree. Then he says "I as an artist can see how beautiful this is but you as a scientist take this all apart and it becomes a dull thing," and I think that he's kind of nutty. First of all, the beauty that he sees is available to other people and to me too, I believe. Although I may not be quite as refined aesthetically as he is ... I can appreciate the beauty of a flower. At the same time, I see much more about the flower than he sees. I could imagine the cells in there, the complicated actions inside, which also have a beauty. I mean it's not just beauty at this dimension, at one centimeter; there's also beauty at smaller dimensions, the inner structure, also the processes. The fact that the colors in the flower evolved in order to attract insects to pollinate it is interesting; it means that insects can see the color. It adds a question: does this aesthetic sense also exist in the lower forms? Why is it aesthetic? All kinds of interesting questions which the science knowledge only adds to the excitement, the mystery and the awe of a flower. It only adds. I don't understand how it subtracts.</textarea><br>
<button onclick="encode()">Text -> DNA</button><button onclick="decode()">DNA -> Text</button><br><textarea></textarea><div></div>

Arnauld

Posted 2016-08-30T08:41:30.863

Reputation: 111 334

match...map...join is the same as replace, no? – Neil – 2016-09-01T08:31:39.807

+\0b${c}`` is shorter than parseInt(c,2), although 'ATCG'[c=+\0b${c}`]+'TAGC'[c]` might be shorter still. – Neil – 2016-09-01T08:35:14.473

'ATCG'.search(c[0]) is shorter than 'ATTACGGC'.indexOf(c)/2. Also I think S.indexOf(c,n)==n would work instead of substr. – Neil – 2016-09-01T08:40:42.413

Oh, one final thing I overlooked: c.charCodeAt() suffices without the 0. – Neil – 2016-09-01T09:39:02.687

One, um, post-final thing: use \``s for your encoding string, saves having to quote the " character. – Neil – 2016-09-01T09:54:54.450

@Neil - Thanks! I've also rewritten the decoder to get rid of Object.keys(), substr(), parseInt(), indexOf()', etc. – Arnauld – 2016-09-01T11:04:39.250

The [...s].map(...).join\`` might also be better as a replace, but I didn't count it to check. – Neil – 2016-09-01T11:09:29.147

2

Python 3, 15324 + 122 + 211 = 15657 bytes

This is a very wasteful method which converts each character to a three digit number and, for each of those three digits, assigns it a two unit code.

Encoder, 122 bytes

u='AAAATTTCCGTTTTAAAGGCATCGTCGCGGTAGCAGCGCC'
o=''
for i in o.join("%03d"%ord(i)for i in input()):o+=u[int(i)::10]
print(o)

See the encoded version of the example text here

Decoder, 211 bytes

u=['ATAT','ATTA','ATCG','ATGC','TATA','TACG','TAGC','CGCG','CGGC','GCGC']
l=input()
I=u.index
s=''
for j in[l[i:i+12]for i in range(0,len(l),12)]:s+=chr(int(str(I(j[:4]))+str(I(j[4:8]))+str(I(j[-4:]))))
print(s)

Beta Decay

Posted 2016-08-30T08:41:30.863

Reputation: 21 478

2

Ruby, 97 + 96 + 5152 = 5345 bytes

Encoder program requires the flags -nrzlib.

Zlib::Deflate.deflate($_).bytes{|b|('%04s'%b.to_s(4)).chars{|c|$><<"ATTACGGC"[2*c.to_i,2]}}

Decoder program requires the flags -przlib.

gsub(/../){"ATTACGGC".index($&)/2}
gsub(/..../){$&.to_i(4).chr}
$_=Zlib::Inflate.inflate$_

Value Ink

Posted 2016-08-30T08:41:30.863

Reputation: 10 608

1

Python 2, 12770 + 204 + 271 = 13245

Encoder:

from itertools import*
D=[''.join(p)for p in permutations('AT TA CG GC'.split())]
e=lambda x:''.join('AT'+D[c-32]if c<56else'TA'+D[c-56]if c<80else'CG'+D[c-80]if c<104else'GC'+D[c-104]for c in map(ord,x))

Decoder:

from itertools import*
D=[''.join(p)for p in permutations('AT TA CG GC'.split())]
n=D.index
d=lambda x:''.join(chr(n(c[2:])+32)if'AT'==c[:2]else chr(n(c[2:])+56)if'TA'==c[:2]else chr(n(c[2:])+80)if'CG'==c[:2]else chr(n(c[2:])+104)for c in map(''.join,zip(*[iter(x)]*10)))

Every character is encoded to a 10-character (5 unit) string, with the first unit indicating what quarter of the printable character string it belongs to, and the last four indicating what character in that quarter is represented.

Try it here.

atlasologist

Posted 2016-08-30T08:41:30.863

Reputation: 2 945

1

PHP, 5106 + 297 + 161 = 5564

encoder

$t=str_replace("=","",base64_encode((gzdeflate($_GET["e"]))));$c="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";$e=["AT","TA","CG","GC"];$n="";foreach(str_split($t)as$b){$p=str_pad(base_convert(strpos($c,$b),10,4),3,"0",STR_PAD_LEFT);$n.=$e[$p[0]].$e[$p[1]].$e[$p[2]];}echo $n;

decoder

$e=["AT"=>0,"TA"=>1,"CG"=>2,"GC"=>3];$a=str_split(strtr($_GET["d"],$e),3);$o="";foreach($a as $v)$o.=$c[base_convert($v,4,10)];echo gzinflate(base64_decode($o));

Jörg Hülsermann

Posted 2016-08-30T08:41:30.863

Reputation: 13 026

1

Node.js (5.6.0), 5104 + 125 + 122 = 5351

I was originally working in JavaScript, but Node.js provides compression much more easily. Thanks Neil for helping trim out 5 bytes on the encoder!

Encoder (125 bytes):

s=>[...require('zlib').deflateRawSync(s)].map(a=>('000'+a.toString(4)).slice(-4).replace(/./g,b=>'ATCG'[b]+'TAGC'[b])).join``

Decoder (122 bytes):

s=>require('zlib').inflateRawSync(Buffer(s.replace(/../g,a=>'ATCG'.search(a[0])).match(/..../g).map(a=>parseInt(a,4))))+''

DNA Sequence (5104 bytes):

TATATATATATATAATGCATCGTACGATGCCGGCTACGGCATGCATATATATGCATGCGCGCTAATTATATATACGATCGATCGGCCGTAGCATGCATATCGTAGCGCATCGATATATTACGTAGCTAGCATATGCCGCGATATCGTATAGCCGCGATCGCGGCCGTAATATGCGCGCTAATATCGGCATGCTACGATCGATGCATGCCGGCATTAATTACGCGTATACGTATAGCTAATTAATTATAATCGCGCGATTAGCCGGCGCGCGCTAGCGCTAATTAGCGCCGCGTATATAGCCGATCGTATACGTAATCGGCATTAATATCGATCGTAGCATCGGCTATACGGCTAATCGATGCGCCGTATAATTAATTATAATGCCGGCCGATATGCATGCATATTATACGGCCGCGCGTATAATGCATCGATTAGCCGTACGCGTAATTAGCTATACGTAATCGTACGTAGCCGCGTATACGTAATATATCGCGTAATTATAATCGATCGGCTACGCGCGGCGCCGATATGCTAGCATCGATCGTAATATGCTAATCGTATACGTATATACGTACGCGATCGGCATGCATGCATCGCGGCTATACGTACGGCCGATATGCATTAGCATATCGTAGCTAATCGATGCGCCGGCGCATATCGATCGATGCGCTAATATGCCGATTAATGCCGTAGCTATACGCGATCGTATAATGCGCGCATCGATGCATGCCGTAATTACGCGGCTACGATTACGCGTACGATGCTATAGCTACGCGTAATTAATTAGCTAGCATCGGCCGATTACGGCCGCGGCTACGTAGCATGCTAATCGCGTAATTATAGCGCCGGCGCATGCTAGCATGCTAATCGCGATGCTAATTAATCGTAATTACGCGCGCGCGTAATCGGCCGCGATGCCGCGTATAATTACGTAATGCCGTATAGCTAATCGATGCCGCGTACGCGGCGCTACGATGCCGCGATATCGCGTACGATCGTAGCTAGCATATTAATATTATAGCCGGCCGCGTAGCATGCATCGCGATTATATACGCGCGGCGCATGCTATAATTAATTAATGCTAATTAATCGCGATGCCGCGCGGCCGATTACGCGCGTAGCCGTATACGTAGCATCGGCGCATCGGCCGCGTAGCTATAATGCGCCGGCTAATCGGCTACGGCATATTAATGCCGCGATCGATCGTACGGCGCTATAATGCCGGCCGCGTAGCCGCGTATAATTAATGCCGATTAATCGCGGCGCTATATAATATATTATAGCGCGCATTATATAGCGCCGATGCTAGCGCTATACGGCGCTACGTAATTATAATATTATATAGCATGCTAGCCGCGATTATAGCATCGATGCTAATGCTACGCGCGCGTACGCGGCTACGATTAATTAGCTAATCGATATATGCTACGATTATATACGGCCGTAATTACGCGCGTAGCCGCGTAATGCCGTAGCATATCGTACGATATATTATACGGCGCTAATTACGCGATCGCGCGATTACGATCGTAGCATTAGCCGTAATTAATATATTATAGCATTAGCATTACGCGATCGCGGCATCGCGCGTAATGCTATAGCATCGCGGCTATAGCCGTAGCTATATAATGCATGCCGGCGCTAGCTAGCGCATATATTACGGCATATGCATTAGCGCATGCCGATATGCTACGATATCGGCTAGCCGATCGTATACGCGATCGGCCGCGTATAATCGCGCGGCGCTACGTATAGCTACGCGCGTAATTATAGCGCCGGCATTACGCGATATATGCCGTAGCATGCCGGCCGCGATTAGCCGCGTAATATATATTACGCGTAGCTAGCTAATTACGGCTACGGCTATAGCCGGCCGATTATAATTAATATTACGGCCGCGGCTATAGCATATATATGCGCATGCATTACGATTATATAGCATGCATATATCGATATCGTAATCGATTATAATATTATACGATCGCGCGATTACGATTATATACGTATAGCATATTAGCTACGTATAGCATTATAGCATATATGCTAGCTAGCCGGCTAATTAATATATGCCGCGATATATATGCGCATTAGCATGCCGATCGCGTAGCGCATCGATTAGCATATCGGCATGCGCGCCGTAATGCTAATGCTATATAATCGCGCGGCGCGCCGGCCGGCGCCGGCATATGCATATATATGCCGGCATGCATGCCGATTATATAGCGCGCCGCGTATATAGCCGCGATGCCGGCATTAGCATCGGCATCGCGGCTAGCGCATCGATCGTACGTACGTATAATTAGCATATCGCGCGCGGCATGCCGCGATTACGTAGCTAGCATTAATATATATCGATTAGCGCATGCTAATATATGCCGATATATATATTATAGCCGATCGGCGCATTATAGCCGATTAATTAATCGCGATTACGTACGGCATTATATACGTATACGCGCGTATAGCGCATATATTAATATCGATATGCGCGCATTATATATAGCCGATTACGTAATGCCGGCGCCGCGATGCATTAATCGATGCTAATATATTATACGTAATATCGGCTACGCGTATAGCATATATTACGCGATTACGATCGGCGCGCGCGCATGCTAGCATATCGATGCGCGCCGTAGCCGATCGCGCGGCATTAGCATGCGCTATAATCGATATATATTAGCCGATCGTATACGGCTACGCGCGTAATGCATTAGCCGTAGCATGCCGATTATAGCATATCGCGGCGCATATCGCGGCTACGGCATTAATCGATTATAATCGATGCATTAGCGCCGTAATGCCGATCGATTAGCATCGCGGCTATACGATCGGCTATAGCATTAATATATATTATACGCGGCATATATTAGCGCATTAGCTACGTAATATGCGCCGATTATACGGCCGGCATGCGCTACGGCATATTAGCCGCGTAATCGTACGCGATATTACGCGCGTAATGCCGTAGCGCATCGGCATATATCGCGATGCTACGTAATGCATGCATGCTAATCGATGCCGCGATCGATGCGCTAGCATCGTATAATATCGGCTAATCGTATACGTATATATAGCGCGCTATACGTAATGCCGGCTATATACGGCCGGCGCATGCGCATATGCATCGGCTAATTACGCGATCGTACGATCGATATATGCTACGGCGCGCATATCGGCTAGCCGCGCGCGGCCGGCTAATGCCGCGCGATATCGGCGCTATACGGCCGTACGATGCCGCGCGATATCGATGCTATAATTACGCGTAATCGATATGCGCTAGCGCCGATCGTATAGCTACGTAGCGCATTACGGCATGCTACGCGGCATGCATTAATCGGCCGATCGGCGCCGTAGCTACGATATCGTACGGCGCCGGCGCATTAGCTAGCGCGCGCGCCGCGGCATGCTATAATATTACGTACGCGTACGTAATATTAGCATGCATTAATTAGCCGGCATTATAGCTACGCGTACGCGATGCATTATACGTAGCGCTAGCATCGCGCGCGCGTATATAATGCCGCGTATACGATTAGCTATAGCCGGCCGCGCGGCTAGCATGCCGATCGATGCTAATCGGCCGTAATTACGATATTATATATATATATAGCTAGCGCATGCCGTACGGCCGTACGGCATATATTAATTAATTATAGCGCTAGCTACGATTATAGCTATATAGCTACGTAATCGCGGCATCGTAATCGGCGCTAGCCGCGATGCTAGCCGATCGCGGCCGGCCGCGTAATATGCCGGCGCCGTAATGCCGATCGTATATACGGCATCGCGGCCGCGATGCATGCCGATATCGTACGCGATATCGCGGCTATAGCCGATATCGTATACGCGTAATGCGCATGCTACGATATGCCGCGCGTATATAGCCGATCGGCCGCGTAATATTATAATCGGCATGCATATGCGCCGCGCGTACGGCGCTAATATCGCGCGATTAATTAATATATATTAGCATCGCGATGCGCATTATAATGCTACGCGGCGCCGGCGCCGATCGTATATAATGCCGGCCGCGATATATTAGCTATACGCGCGCGCGGCGCGCTACGCGTATAGCGCGCATATCGATATTAGCATGCTAATCGTAGCATGCATATCGTATACGCGCGATGCGCCGCGATGCATCGGCGCGCGCATGCCGGCGCGCGCCGCGCGGCATTATACGTAATCGTAGCGCGCGCCGATATGCTAATTAGCGCCGCGTAGCATGCTACGGCCGCGGCCGGCCGGCGCGCCGCGGCGCATATCGCGGCATCGCGGCTAATATCGCGATGCCGCGCGGCGCCGATCGTAGCTAGCTAGCGCGCATCGATATATATGCGCGCATGCCGCGTAGCTAGCGCATCGTAATTATAGCATCGCGCGCGGCATCGCGGCGCGCGCTAATTACGTAGCGCTACGCGATTACGCGATTAATGCGCGCCGGCATCGGCATTAGCGCATGCATTATATATATACGCGCGTACGGCGCTACGATATCGCGGCTATAATATGCTAGCGCCGCGATATATCGGCGCATCGTAATATGCGCTAATATATTAGCGCTAATATGCATCGGCATCGATCGGCTACGTATAATATGCGCGCGCATATATCGATATTAGCCGCGTACGTATACGTACGTATATAATGCTAATGCGCATTATAGCCGATTACGTACGATATGCGCGCATTAATCGTATATAGCTATACGATCGCGGCTATACGATATGCTAGCCGTATAGCGCTAGCATATATTAATATCGGCATATGCTAGCGCCGGCCGCGGCTATATACGGCATGCGCGCTACGGCGCTACGCGCGGCGCGCCGTATAATTATACGCGTAGCCGGCGCATATATCGGCCGTAGCCGGCATATGCGCATCGGCGCTAGCTATAGCTAATGCGCTAGCTACGTATACGTAGCCGTAATGCGCCGATGCGCATTAGCATGCTAGCCGCGCGTATACGTACGTATACGATCGCGCGGCGCTATACGCGGCGCCGTACGATGCGCGCCGATATGCATATCGATATATCGGCATGCGCTATACGTAGCATGCATGCATGCTATACGTACGATGCGCGCGCGCGCATATATTA

Mwr247

Posted 2016-08-30T08:41:30.863

Reputation: 3 494

1I think 'ATCG'[d]+'TAGC'[d] saves you some bytes. – Neil – 2016-09-01T09:50:49.140

@Neil I knew there must be some way to trim that part down, but I couldn't figure out how. Thank you! – Mwr247 – 2016-09-01T14:08:13.147