RNA to Protein Translation

18

5

RNA, like DNA, is a molecule found in cells encoding genetic information. It consists of nucleotides, which are represented by the bases adenine (A), cytosine (C), guanine (G) and uracil (U).* A codon is a sequence of three nucleotides.

Proteins are large molecules which perform a vast array of functions, such as keratin which is found in hair and nails and hemoglobin which carries oxygen in blood cells. They are made up of amino acids, which are encoded as codons in RNA molecules. Sometimes different codons may encode for the same amino acid. Each amino acid is commonly represented by a single letter, for example H stands for histidine.

Given a sequence of ACGU, can you translate it into the corresponding protein string?

* DNA consists of ACGT, where the T is thymine. During DNA to RNA transcription, thymine is replaced by uracil.


Input

Input will be a single string consisting of only the characters ACGU in upper case. You may write either a function or a full program for this challenge.

Output

You may choose to output via either printing or returning a string (the latter choice is only available in the case of a function).

Translation should begin at a start codon (AUG, represented as M) and end at a stop codon (one of UAA, UAG or UGA, represented as *). There are four cases where input may be invalid:

  • The input does not begin with a start codon
  • The input does not end with a stop codon
  • The input's length isn't a multiple of 3
  • The input contains a stop codon somewhere other than at the end

In all of these cases, Error should be outputted. Note that, unlike stop codons, start codons may appear after the start of the string.

Otherwise, you should convert each codon into its respective amino acid via the following RNA codon table:

* UAA UAG UGA
A GCU GCC GCA GCG
C UGU UGC
D GAU GAC
E GAA GAG
F UUU UUC
G GGU GGC GGA GGG
H CAU CAC
I AUU AUC AUA
K AAA AAG
L UUA UUG CUU CUC CUA CUG
M AUG
N AAU AAC
P CCU CCC CCA CCG
Q CAA CAG
R CGU CGC CGA CGG AGA AGG
S UCU UCC UCA UCG AGU AGC
T ACU ACC ACA ACG
V GUU GUC GUA GUG
W UGG
Y UAU UAC

...and output the translated string.

Examples

Invalid cases:

<empty string> -> Error
AUG -> Error
UAA -> Error
AUGCUAG -> Error
AAAAAAA -> Error
GGGCACUAG -> Error
AUGAACGGA -> Error
AUGUAGUGA -> Error
AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA -> Error

Valid cases:

AUGUGA -> M*
AUGAGGUGUAGCUGA -> MRCS*
AUGGGUGAGAAUGAAACGAUUUGCAGUUAA -> MGENETICS*
AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA -> MPVARLVHTLL*
AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG -> MLRSSHLALWLGCVTSRTVSPVPGSIMRAMSYDS*
AUGAAAAACAAGAAUACAACCACGACUAGAAGCAGGAGUAUAAUCAUGAUUCAACACCAGCAUCCACCCCCGCCUCGACGCCGGCGUCUACUCCUGCUUGAAGACGAGGAUGCAGCCGCGGCUGGAGGCGGGGGUGUAGUCGUGGUUUACUAUUCAUCCUCGUCUUGCUGGUGUUUAUUCUUGUUUUAA -> MKNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVYYSSSSCWCLFLF*

Edit: Added more test cases

Scoring

This is code golf, so the code in the fewest bytes wins.

Note: I'm no expert in molecular biology, so feel free to correct me if I've misstated anything :)

Sp3000

Posted 2014-11-30T13:26:07.213

Reputation: 58 729

1A proper translator should be able to find the open reading frame in any string, not just those that start with AUG! – canadianer – 2014-12-01T06:25:41.350

@canadianer Ahaha yeah I initially considered that, but I didn't want to make the question too complicated by bringing in open reading frames (or even translating multiple proteins from a single string) :) – Sp3000 – 2014-12-01T06:51:23.993

The empty string would be a useful test case, because it will break some approaches for testing that the decoded sequence starts with M and ends with *. – Peter Taylor – 2014-12-01T11:44:02.703

@PeterTaylor Added, along with a few more short test cases :) – Sp3000 – 2014-12-01T11:56:45.037

1If you wanted to be a real pain, you could use DNA instead of RNA, so you have backwards reading frames too. – user137 – 2014-12-02T18:45:36.133

Answers

6

CJam (97 93 92 91 bytes)

q"GACU"f#3/{4b"GGEDAAVVRSKNTTMIRRQHPPLLWC*YSSLF"{_s"MW""I*"er}%=}%s_'*/(1<"M"=*Qa=\"Error"?

This is a port of my GolfScript solution with a slightly tweaked hash function because to my surprise one thing which CJam hasn't borrowed from GolfScript is treating strings as arrays of integers.

6 bytes saved thanks to suggestions by Optimizer (including two bytes from something which I thought I'd tried and didn't work - huh).

Peter Taylor

Posted 2014-11-30T13:26:07.213

Reputation: 41 901

1q"GACU"f#3/{4b"GGEDAAVVRSKNTTMIRRQHPPLLWC*YSSLF"{_s"MW""I*"er}%=}%s_'*/(1<"M"=*Q="Error"@? - 90 – Optimizer – 2014-12-01T18:45:11.763

@Optimizer, some of that seems to be an improvement. However, it gives a runtime error, and the comparison to Q rather than [Q] is just incorrect. – Peter Taylor – 2014-12-02T11:06:53.097

>

  • you have not copied the code correctly, when the code spans multiple lines in comments, it gets a weird unicode character at the line break. You will have to manually type the code urself then. 2. See the logic, it has been altered to correctly work, thus [Q] to Q change is correct.
  • < – Optimizer – 2014-12-02T11:11:16.930

    @Optimizer, try test case AUGUAGUGA – Peter Taylor – 2014-12-02T11:18:47.683

    1Ah okay. Still [Q] -> Qa – Optimizer – 2014-12-02T11:27:34.627

    You can always convert a string to an array with :i – Ypnypn – 2014-12-02T14:29:26.490

    @Ypnypn, thanks. It turns out that {i)7&2/}% is one char worse than "GACU"f#, but I'll bear it in mind for future golfing. – Peter Taylor – 2014-12-02T14:43:49.920

    10

    JavaScript (ES6) 167 177 chars encoded in UTF8 as 167 177 bytes

    ... so I hope everyone is happy.

    Edit In fact, no need for a special case for last block too short. If the last 2 (or 1) chars are not mapped, the result string does not end with '*' and that gives error anyway.

    F=s=>/^M[^*]*\*$/.test(s=s.replace(/.../g,x=>
    "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"
    [[for(c of(r=0,x))r=r*4+"ACGU".search(c)]|r]))?s:'Error'
    

    Explained

    Each char in a triplet can have 4 values, so there are exactly 4^3 == 64 triplets. The C function map each triplet to a number between 0 and 63. No error check needed as input chars are only ACGU.

    C=s=>[for(c of(r=0,s))r=r*4+"ACGU".search(c)]|r
    

    Each triplet maps to an aminoacid identified by a single char. We can encode this in a 64 character string. To obtain the string, start with the Codon Map:

    zz=["* UAA UAG UGA","A GCU GCC GCA GCG","C UGU UGC","D GAU GAC","E GAA GAG"
    ,"F UUU UUC","G GGU GGC GGA GGG","H CAU CAC","I AUU AUC AUA","K AAA AAG"
    ,"L UUA UUG CUU CUC CUA CUG","M AUG","N AAU AAC","P CCU CCC CCA CCG","Q CAA CAG"
    ,"R CGU CGC CGA CGG AGA AGG","S UCU UCC UCA UCG AGU AGC","T ACU ACC ACA ACG"
    ,"V GUU GUC GUA GUG","W UGG","Y UAU UAC"]
    a=[],zz.map(v=>v.slice(2).split(' ').map(x=>a[C(x)]=v[0])),a.join('')
    

    ...obtaining "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"

    So we can scan the input string and use the same logic of the C function to obtain the code 0..63, and from the code the aminoacid char. The replace function wil split the input string in 3 chars blocks, eventually leaving 1 or 2 chars not managed (that will give an invalid result string, not ending in '*').

    At last, check if the encoded string is valid using a regexp: it must start with 'M', must not contain '*' and must end with '*'

    Test In FireBug/FireFox console

    ;['AUGCUAG','GGGCACUAG','AUGAACGGA','AUGUAGUGA','AAAAAAA',
    'AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA',
    'AUGAGGUGUAGCUGA','AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA',
    'AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG']
    .forEach(c=>console.log(c,'->',F(c)))
    

    Output

    AUGCUAG -> Error
    GGGCACUAG -> Error
    AUGAACGGA -> Error
    AUGUAGUGA -> Error
    AAAAAAA -> Error
    AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA -> Error
    AUGAGGUGUAGCUGA -> MRCS*
    AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA -> MPVARLVHTLL*
    AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG -> MLRSSHLALWLGCVTSRTVSPVPGSIMRAMSYDS*
    

    edc65

    Posted 2014-11-30T13:26:07.213

    Reputation: 31 086

    Nice idea! Was just thinking of doing this. You beat me to it! – Optimizer – 2014-11-30T17:45:04.220

    8

    C,190 bytes (function)

    f(char*x){int a=0,i=0,j=0,s=1;for(;x[i];i%3||(s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42,x[j]=a=0))a=a*4+(-x[i++]/2&3);puts(*x-77||i%3||s||x[j-1]-42?"Error":x);}
    

    199 194 bytes (program)

    a,i,j;char x[999];main(s){for(gets(x);x[i];i%3||(s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42,x[j]=a=0))a=a*4+(-x[i++]/2&3);puts((*x-77||i%3||s||x[j-1]-42)?"Error":x);}
    

    Saved a few bytes by improving hash formula.

    Here's a fun test case:

    AUGUAUCAUGAGCUCCUUCAGUGGCAAAGACUUGACUGA --> MYHELLQWQRLD* 
    

    Explanation

    The triplet of letters is converted to a base 4 number. Each letter is hashed as follows.

    x[i]       ASCII code       Hashed to (-x[i]/2&3) 
    A        64+ 1  1000001            00   
    G        64+ 7  1000111            01
    U        64+21  1010101            10   
    C        64+ 3  1000011            11
    

    This gives a number in the range 0..63. The idea now is to use a lookup table similar to the ones used by edc65 and Optimizer. However, the hash is designed so that G and A are next to each other and U and C are next to each other.

    Looking at the table at https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table, we see that with the letters ordered in this way, generally the last bit can be ignored. Only a 32-character lookup table is needed, except in two special cases.

    See below first two letters, and corresponding amino acids (where 3rd letter is G/A, and where 3rd letter is U/C). Corrections for the two special cases that do not fit the 32-character table are hardcoded.

         A/G U/C          A/G U/C            A/G U/C         A/G U/C  
    AAX> K   N       AGX> R   S         AUX> I   I      ACX> T   T
    GAX> E   D       GGX> G   G         GUX> V   V      GCX> A   A
    UAX> *   Y       UGX> *   C         UUX> L   F      UCX> S   S
    CAX> Q   H       CGX> R   R         CUX> L   L      CCX> P   P
    
    Corrections for special cases (where last bit cannot be ignored)
    AUG 001001=9 -->  M
    UGG 100101=37-->  W
    

    Commented code

    In golfed version the i%3 code is in the increment position of the for bracket, but it is moved into a more readable position in the commented code.

    a,i,j;char x[999];                                                             //Array x used for storing both input and output. i=input pointer, j=output pointer.
    main(s){                                                                       //s is commandline string count. if no arguments, will be set to zero. Will be used to count stops.
      for(gets(x);x[i];)                                                           //Get a string, loop until end of string (zero byte) found
        a=a*4+(-x[i++]/2&3),                                                       //Hash character x[i] to a number 0-3. leftshift any value already in a and add the new value. Increment i.
        i%3||(                                                                     //if i divisible by 3,
          s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42,  //lookup the correct value in the table. for special cases a=24 and a=32 map to 'M' and 'W' (ASCII 77 and 87). If character is '*' (ASCII42) decrement s.   
          x[j]=a=0                                                                 //reset a to 0. clear x[j] to terminate output string.                                                     
        );   
      puts((*x-77||i%3||s||x[j-1]-42)?"Error":x);                                  //if first character not M or i not divisible by 3 or number of stops not 1 or last character not * print "Error" else print amino acid chain.
    }
    

    Level River St

    Posted 2014-11-30T13:26:07.213

    Reputation: 22 049

    If only there was an O! I added a test case for MGENETICS* though, because it's the most thematic word I could make :P – Sp3000 – 2014-12-01T04:04:48.037

    6

    CJam, 317 121 104 bytes

    q3/{{"ACGU"#}%4b"KN T RS IIMI QH P R L ED A G V *Y S *CWC LF"S/{_,4\/*}%s=}%_('M=\)'*=\'*/,1=**\"Error"?
    

    This can still be golfed further.

    Updated the mapping mechanism to the one used in edc65's answer. Even though I came up with this on my own, he beat me to it :)

    UPDATE : Shortened the codon table map by observing the pattern in it.

    Try it online here

    Optimizer

    Posted 2014-11-30T13:26:07.213

    Reputation: 25 836

    This breaks if the input is the empty string. – Peter Taylor – 2014-12-01T16:47:59.807

    @PeterTaylor A rule which was added on your suggestion after the answer was posted ;) . I will update the code soon. – Optimizer – 2014-12-01T16:49:43.937

    1It wasn't a rule that was added, it was a test case which was already implicitly required by the rules. – Peter Taylor – 2014-12-01T16:50:22.640

    3

    GolfScript (103 bytes)

    {)7&2/}%3/{4base'GGEDAAVVRSKNTTMIRRQHPPLLWC*YSSLF'{..'MW'?'I*'@),+=}%=}%''+.'*'/(1<'M'=*['']=*'Error'or
    

    Online demo (NB doesn't include the two largest test cases because it needs to run in 15 seconds).

    Dissection

    As Steve Verrill pointed out in the sandbox, the lookup table can be reduced to 32 elements plus two special cases. It turns out that the special cases both involve characters (M and W respectively) which only occur once, and with the right mapping of the characters to base 4 digits it's possible to build the full 64-element lookup table from the 32 elements by doing a duplicate-and-tr:

    'GGEDAAVVRSKNTTMIRRQHPPLLWC*YSSLF'  # 32-element lookup table
    {                                   # Map over the 32 elements...
      .                                 #   Duplicate the element
      .'MW'?'I*'@),+=                   #   Apply tr/MW/I*/ to the duplicate
    }%
    

    Then once we've done the decode, the validation allows for many approaches. The shortest I've found is

    .'*'/       # Duplicate and split the copy around '*' retaining empty strings
    (1<'M'=*    # Pull out the first string from the split (guarantee to exist even if input is
                # the empty string); if it starts with 'M' leave the rest of the split intact;
                # otherwise reduce it to the empty array
    ['']=       # Check whether we have ['']. If so, the split produced [prefix ''] where
                # prefix begins with 'M'. Otherwise we want an error.
    *           # If we have an error case, reduce the original decoded string to ''
    'Error'or   # Standard fallback mechanism
    

    Peter Taylor

    Posted 2014-11-30T13:26:07.213

    Reputation: 41 901

    1 byte. Challenge accepted! – Optimizer – 2014-12-01T15:47:03.443

    @Optimizer, a straight translation to CJam will save a good few bytes because it has a lot of relevant built-ins. – Peter Taylor – 2014-12-01T15:56:22.783

    My hashing function is 57 bytes long, while yours is 52. So I can only see at most 5 bytes saving ... – Optimizer – 2014-12-01T15:58:32.893

    I'm glad my comment in the sandbox was useful. I hoped it might be possible to use the fact that M was one of the special cases to test for a valid start, but it hasn't worked out that way. There are still 8 pairs of identical letters in that string. I wonder if they can be compressed as lowercase letters: g-->GG a-->AA etc. If decompression can be achieved in under 8 characters it would be worthwhile. – Level River St – 2014-12-02T20:03:57.803

    1

    Python, 473 bytes

    t={'U(A[AG]|GA)':'*','GC.':'A','UG[UC]':'C','GA[UC]':'D','GA[AG]':'E','UU[UC]':'F','GG.':'G','CA[UC]':'H','AU[UCA]':'I','AA[AG]':'K','(UU[AG]|CU.)':'L','AUG':'M','AA[UC]':'N','CC.':'P','CA[AG]':'Q','(CG.|AG[AG])':'R','(UC.|AG[UC])':'S','AC.':'T','GU.':'V','UGG':'W','UA[UC]':'Y'}
    import re
    i=raw_input()
    a=''
    for x in[i[y:y+3]for y in range(0,len(i),3)]:
     a+=[t[u]for u in t.keys()if re.match(u, x)][0]
    print["Error",a][all((a[0]+a[-1]=="M*",len(i)%3==0,not"*"in a[1:-1]))]
    

    James Williams

    Posted 2014-11-30T13:26:07.213

    Reputation: 1 735

    1

    Python 2, 370 358 354 bytes

    This is a very straight forward approach using no compression, just trying to pack the information quite densly:

    s=lambda x:x and[x[:3]]+s(x[3:])or[]
    def f(I):O=''.join(d*any(0==x.find(p)for p in e)for x in s(I)for d,e in zip('*ACDEFGHIKLMNPQRSTVWY',map(s,'UAAUAGUGA,GC,UGUUGC,GAUGAC,GAAGAG,UUUUUC,GG,CAUCAC,AUUAUCAUA,AAAAAG,UUAUUGCU,AUG,AAUAAC,CC,CAACAG,AGAAGGCG,AGUAGCUC,AC,GU,UGG,UAUUAC'.split(','))));return['Error',O][len(I)%3==0==len(O)-O.find('*')-(O[0]=='M')]
    

    Edit: Shaved off a few characters following xnor's suggestion.

    Emil

    Posted 2014-11-30T13:26:07.213

    Reputation: 1 438

    I believe you can write s shorter recursively as s=lambda x:x and[x[:3]]+s(x[3:]). – xnor – 2014-11-30T19:25:25.930

    @xnor Great, I didn't think of that. It does not quite work like that, because at the end of the recursion it will output an empty string not an empty list. But with four more character I can make it work. Thanks! – Emil – 2014-11-30T19:41:22.873

    1

    Scala (317 chars)

    def g(c:Char)="ACGU"indexOf c;def h(s:String,i:Int)=g(s(i))*16+g(s(i+1))*4+g(s(i+2));def p(a:Int)=a!=48&&a!=50&&a!=56;def f(s:String)=if(s.length%3!=0||h(s,0)!=14||p(h(s,s.length-3)))"Error"else{var r="";for(i<-0 to s.length-3 by 3)r+="KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"charAt h(s,i);r}
    

    The main function is f. Of course, a better choice would be to return an Option[String].

    bb94

    Posted 2014-11-30T13:26:07.213

    Reputation: 1 831

    0

    JavaScript (ES6), 143 bytes

    s=>/^M\w*\*$/.test(s=s.replace(/.../g,c=>"PVLVLHDVLGRGRAPGR*KYNVL*KTSTSGRTSILIFYNMLR*SCTSRWEQDHIFEQPAPASC.A"[parseInt(c,36)%128%65]))?s:'Error'
    

    Try it online!

    Arnauld

    Posted 2014-11-30T13:26:07.213

    Reputation: 111 334