18
5
RNA, like DNA, is a molecule found in cells encoding genetic information. It consists of nucleotides, which are represented by the bases adenine (A), cytosine (C), guanine (G) and uracil (U).* A codon is a sequence of three nucleotides.
Proteins are large molecules which perform a vast array of functions, such as keratin which is found in hair and nails and hemoglobin which carries oxygen in blood cells. They are made up of amino acids, which are encoded as codons in RNA molecules. Sometimes different codons may encode for the same amino acid. Each amino acid is commonly represented by a single letter, for example H stands for histidine.
Given a sequence of ACGU
, can you translate it into the corresponding protein string?
* DNA consists of ACGT, where the T is thymine. During DNA to RNA transcription, thymine is replaced by uracil.
Input
Input will be a single string consisting of only the characters ACGU
in upper case. You may write either a function or a full program for this challenge.
Output
You may choose to output via either printing or returning a string (the latter choice is only available in the case of a function).
Translation should begin at a start codon (AUG
, represented as M
) and end at a stop codon (one of UAA
, UAG
or UGA
, represented as *
). There are four cases where input may be invalid:
- The input does not begin with a start codon
- The input does not end with a stop codon
- The input's length isn't a multiple of 3
- The input contains a stop codon somewhere other than at the end
In all of these cases, Error
should be outputted. Note that, unlike stop codons, start codons may appear after the start of the string.
Otherwise, you should convert each codon into its respective amino acid via the following RNA codon table:
* UAA UAG UGA
A GCU GCC GCA GCG
C UGU UGC
D GAU GAC
E GAA GAG
F UUU UUC
G GGU GGC GGA GGG
H CAU CAC
I AUU AUC AUA
K AAA AAG
L UUA UUG CUU CUC CUA CUG
M AUG
N AAU AAC
P CCU CCC CCA CCG
Q CAA CAG
R CGU CGC CGA CGG AGA AGG
S UCU UCC UCA UCG AGU AGC
T ACU ACC ACA ACG
V GUU GUC GUA GUG
W UGG
Y UAU UAC
...and output the translated string.
Examples
Invalid cases:
<empty string> -> Error
AUG -> Error
UAA -> Error
AUGCUAG -> Error
AAAAAAA -> Error
GGGCACUAG -> Error
AUGAACGGA -> Error
AUGUAGUGA -> Error
AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA -> Error
Valid cases:
AUGUGA -> M*
AUGAGGUGUAGCUGA -> MRCS*
AUGGGUGAGAAUGAAACGAUUUGCAGUUAA -> MGENETICS*
AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA -> MPVARLVHTLL*
AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG -> MLRSSHLALWLGCVTSRTVSPVPGSIMRAMSYDS*
AUGAAAAACAAGAAUACAACCACGACUAGAAGCAGGAGUAUAAUCAUGAUUCAACACCAGCAUCCACCCCCGCCUCGACGCCGGCGUCUACUCCUGCUUGAAGACGAGGAUGCAGCCGCGGCUGGAGGCGGGGGUGUAGUCGUGGUUUACUAUUCAUCCUCGUCUUGCUGGUGUUUAUUCUUGUUUUAA -> MKNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVVYYSSSSCWCLFLF*
Edit: Added more test cases
Scoring
This is code golf, so the code in the fewest bytes wins.
Note: I'm no expert in molecular biology, so feel free to correct me if I've misstated anything :)
1A proper translator should be able to find the open reading frame in any string, not just those that start with AUG! – canadianer – 2014-12-01T06:25:41.350
@canadianer Ahaha yeah I initially considered that, but I didn't want to make the question too complicated by bringing in open reading frames (or even translating multiple proteins from a single string) :) – Sp3000 – 2014-12-01T06:51:23.993
The empty string would be a useful test case, because it will break some approaches for testing that the decoded sequence starts with
M
and ends with*
. – Peter Taylor – 2014-12-01T11:44:02.703@PeterTaylor Added, along with a few more short test cases :) – Sp3000 – 2014-12-01T11:56:45.037
1If you wanted to be a real pain, you could use DNA instead of RNA, so you have backwards reading frames too. – user137 – 2014-12-02T18:45:36.133